****************************************************************************************
* These notes contain NDSU confidential and proprietary material. Patents are pending *
* on the concepts and applications of bSQ organization and P-tree technology, etc. *
****************************************************************************************
THE DATA
Spatial Data Organizations
First we consider ways of organizing spatial data. Spatial attributes such as
remotely sensed reflectances (R,G,B,NIR,..), ground attributes (yield levels,
soil moisture levels, elevations),etc., are referred to as "bands".
Let R(P, B1,...Bn) be the file or relation containing these data bands as columns
or attributes for a particular space or area, where P is the key (pixel coordinates,
x-y, of the points in the space) and each column, B1,..,Bn, measures the level of
that attribute for each pixel location. If the inverted list model is used rather
than the relational model (so that one can assume an ordering of the tuples), the
raster ordering of coordinates is usually assumed (first row, followed by second row,
followed by third row, ...)
This Relational (REL) organization is the starting point or basic organization.
In Band-SeQuential (BSQ) organization, the REL organization is projected into many
files - a separate files for each band or column. The coordinate ordering is
assumed to be raster order, and thus need not be part of each band file. Each
band file is then a 1-column file of the measurements for that attribute at each
pixel in raster order (eg, TM data from Landsat satellites is organized as BSQ).
In Band-Interleaved-by-Line (BIL) there is just one file in which the
first row (line) of the first band is followed by the
first row of the second band, ..., followed by the
first row of the last band, followed by the
second row of the first band, followed by the
second row of the second band, ... etc. (e.g., SPOT data from French Satellites is BIL)
In Band-Interleaved-by-Pixel (BIP), there is just one file in which the
first pixel-value of the first band is followed by the
first pixel-value of the second band,..., the
first pixel-value of the last band, followed by the
second pixel-value of the first band,... (e.g., tiff images are BIP).
We note BIP is nearly identical to REL except there are no explicit "record"
or row boundary markers (ie, data is not organized into records, but the values
are in the same order as they are in REL).
A new organization at the "interleaving extreme" end of this spectrum of
organizations is Band-Interleaved-by-bit (BIb) in which there is just one file, the
first bit of the first pixel-value of the first band is followed by the
first bit of the first pixel-value of the second band,..., the
first bit of the first pixel-value of the last band, followed by the
second bit of the first pixel-value of the first band,...
Another new organization, at the other end of this organization spectrum is
bit-SeQential (bSQ) in which each bit of each band, B11..,18, B21..B28 ... Bn1..Bn8
is a separate file. We will use bSQ organization later in this course.
We have the following spectrum of Band-oriented organizations:
REL is the basic organization in which there is one file an no interleaving (we say
there is no interleaving since a relation is a "set" of tuples, not a sequence and
each tuple is a "set" of attribute values, not a sequence. i.e., in a relation,
there is no ordering of values).
more interleaving-- >
bSQ BSQ BIL BIP BIb
< -- more files
A very simple illustrative example (with only 2 bands, each having only 2 rows and 2 columns)
BAND-1 BAND-2
254 127 37 240
(1111 1110) (0111 1111) (0010 0101) (1111 0000)
14 193 200 19
(0000 1110) (1100 0001) (1100 1000) (0001 0011)
REL organization: RRN |x-y | B1 | B2 | (a set of tuples,
|====|====|====| each tuple is a set of attribute values)
0 |0,0 |254 | 37 |
1 |0,1 |127 |240 |
2 |1,0 | 14 |200 |
3 |1,1 |193 | 19 |
bSQ organization: (16 files)
B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1
0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0
1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1
BSQ organization: B1 B2 (two separate files, values given in decimal)
---- ----
254 37
127 240
14 200
193 19
BIL organization: (one file, values given in decimal)
254 127 37 240 14 193 200 19
BIP organization: (one file, values given in decimal)
254 37 127 240 14 200 193 19
Bib organization: (one file, values given in decimal)
10 10 11 10 10 11 10 01 01 11 11 11 10 10 10 10
01 01 00 00 11 10 10 00 10 10 00 01 00 00 01 11
Thru simple offset arithmetic, one can convert among these organizations.
**************************************************************
Note that in traditional Market Basket Data Mining each "item"
is treated as a separate column or attribute of REL and the values
are Boolean (1 or 0, for yes or no). Thus, for MBDM, we start with:
B1 B2 B3 B4 B5 B6 B7 ...
REL organization: |trans|hat |shoe|coat|milk|beer|soap|nails|...
|=====|====|====|====|====|====|====|=====|...
(one relation) |tid-1| 1 | 0 | 0 | 1 | 0 | 1 | 0 |...
|tid-2| 0 | 0 | 0 | 0 | 1 | 0 | 0 |...
|tid-3| 0 | 0 | 1 | 0 | 0 | 0 | 0 |...
|tid-4| 0 | 1 | 0 | 1 | 0 | 1 | 0 |...
|tid-5| 0 | 0 | 0 | 0 | 1 | 0 | 1 |...
. . .
bSQ organization: B1 B2 B3 B4 B5 B6 B7 ...
1 0 0 1 0 1 0
(separate 0 0 0 0 1 0 0
file for 0 0 1 0 0 0 0
for each item) 0 1 0 1 0 1 0
0 0 0 0 1 0 1
. . .
BSQ organization: B1 B2 B3 B4 B5 B6 B7 ...
1 0 0 1 0 1 0
(identical to 0 0 0 0 1 0 0
bSQ) 0 0 1 0 0 0 0
0 1 0 1 0 1 0
0 0 0 0 1 0 1
. . .
BIL organization: (transactions are not in any natural 2-D arrangement
so we consider the tid's to constitute one big row)
10000...00010...00100...10010...01001...10010...00001...
BIP organization:
10010100000100001000001010100000101...
BIb organization: (each pixel is a bit, thus BIb = BIL)
With Boolean data from a Market Basket Database,
in bSQ=BSQ the data is organized into
a separate file for each item ordered by transaction
in BIL the data is organized onto one file ordered by transaction first
and then by item.
in BIP=BIb the data is organized onto one file ordered by item first
and then by transaction.
Note: Market Basket Data Mining is done assuming the REL organization.
**************************************************************
An example of spatial data comes from precision agriculture,
we subdivided or "grid" a field into "pixels" or points (usually evenly).
0 1 2 3 4 5 6 7 8 9 10 11 12
.---.---.---.---.---.---.---.---.---.---.---.---.---.
0| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
1| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
2| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
3| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
4| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
5| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
6| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
7| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
8| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
.| | | | | | | | | | | | | |
.
.
The reflectance levels within given spectral ranges (e.g., Red, Green, Blue..)
are captured by a sensor and recorded in raster-ordered BANDs
RED-band
pix refl
0,0 24
0,1 26
0,2 49
0,3 68
0,4 93
0,5 119
.
.
.
The key for each band is the x,y coordinates. This attribute is usually
omitted since the raster ordering is taken to be understood.
So a "BAND" is a single attribute file of
the relative reflectance levels (expressed as numbers in [0, 255]) observed
in a particular color range (or non-visible range such as infra-red...) or an
agricultural band (yield levels - e.g., bushels per acre for each pixel).
An association rule example: "At points in a field where the midsummer,
Near-Infrared (NIR) reflectance is greater than 48 and
Red reflectance is less than 31, then the
Yield will be greater than 128 bu/acre"
The rule is written { NIR>47, R<32 } => { Y>128 }
- the set, { NIR>47, R<32 } is called the "antecedent" of the rule
- the set { Y>128 } is called the "consequent" of the rule
"SUPPORT" of the rule = % (or ratio) of pixels with NIR>47 and R<32 and Y>128.
- as a ratio, it can be expressed |antecedent UNION consequent| / Total
"CONFIDENCE" = %(or ratio) of pixels with NIR>47 and R<32 which also have Y>128
as a ratio it can be expressed |antecedent UNION consequent| / |antecedent|
If support and confidence of this rule is high, that suggests to the producer
that nitrogen fertilizer should be applied where NIR<47 and/or R>32,
so as to maximize the yield in those areas (get it up over 128 Bu/acre).
For ARM, we need to formally define the notions of items, itemsets and
transactions in spatial datasets.
The items: I = {(b,v) : b= a band, v= a reflectance value}
The transactions: D = {t : t=(tid,t-itemset},
tid=(x,y), the pixel row,col and
t-itmeset = {(b,v): b ranges over all bands and
v is the reflectance at pixel, t, in band, b.}
Note right away that the sizes are very very large in the ARM sense
(e.g., for TM satellite images (with yield bands), there are ~40,000,000
transactions, 8*256 = 2048 items and 2^(2048) itemsets!)
The number of transactions (pixels) can be reduced by focusing on a
particular small area (e.g., a field).
The number of itemsets can be reduced by noting that a pixel can have only
one reflectance value from a given band. Almost always we are interested
in knowing when the values are in a particular range or interval.
Therefore we can restrict our itemset consideration to those composed of
one interval from each band.
---------------------------
In a given band there are 255 ways to pick the left endpoint of the interval
and for left-endpoint, l, there are 255-l ways to pick a right endpoint.
On the average there will be 127 ways to pick the right endpoint.
Thus, there are really only (255*127)^8 or ~(2^8*2^7)^8 = 2^120 = 10^36 =
1,000,000,000,000,000,000,000,000,000,000,000,000 itemsets to consider.
- We can reduce the number of items by partitioning the Bands into intervals
and letting each interval correspond to an value.
Partitioning bands into intervals:
Equilength interval partitioning.
By truncating some of the right-most bits of the values (low order or least
significant bits) we can reduce the size of the itemset dramatically without
loosing too much information (the low order bits show only slight differences).
For example, we can truncate the right-most 6 bits, resulting in 4 intervals,
each of which we consider to be a "value" (e.g., identify each interval
with its midpoint):
[0,64), [64,128), [128,192), [192,256) identified with values, 32, 96, 160, 224
Then there are only 10^8 itemsets or ~ = 100,000,000 itemsets (10 intervals in
each band?). That's still a lot!
Further pruining can be done by understanding what kinds of rules are probably
of interest to the user and focusing on those only. For instance:
For a precision farmer, there is probably little interest in rules of the type,
R>48 => G<134.
A physicist might be interested in relationships among colors observed
(both antecedent and consequent from visible bands), but the farmer is
interested only in relationships where the antecedent is from the color
bands and the consequent if from the yield band (he or she wants to know
what observed color combinations predict high yield).
Therefore, for precision agriculture, we could restrict to those rules that have
consequent from the yield band (and then only the particular interval which
indicates "high yeild") and antecedent from the others, so 10^7 = 10,000,000
itemsets to consider.
We will refer to restrictions of this time (in the type of itemsets allowed for
antecedent and consequent based on interest) as restricting to rules which
are "of interest" (OI rules), as distinct from the notion of rules that are
"interesting". OI rules can be interesting or not interesting, depending on
such measures as support and confidence, etc.
Slalom analogy:
Each transaction (pixel), t, is like a path down a ski hill, each
item is an interval in one band and therefore like a "gate" on the ski slope:
A transaction (pixel) "contains" an itemset, if it "goes thru" each gate
(has band-i reflectance in interval-i).
So if x is an itemset (set of "gates", one for each band),
s(x) is the proportion of paths passing thru the gates of x.
b1 b2 b3 b4 b5 b6 b7 b8
| | .---. | |
t---. | / | \ | |
`---------------------------' | \ | |
| | | \ | .----
| | | | \_______/ |
Non-equi-length:
In some cases, it would be better to allow users to partition interval into
uneven lengths. User knowledge can be applied in interval partition.
Eg,, band bi can be partitioned into 3 intervals {[0,63), [64,127), [128,256)
(if aren't many values between 128 to 255.)
Applying user's domain knowledge increases assoc rules accuracy and efficiency.
Equi-depth partitioning (each partition has approx. the same number of pixels).
Can be done by setting the endpoints so that there are (approximately) the same
number of values in each interval (at the mean value), etc.
Sometime this leads to more reasonable rules.
Whether partitioning is equilength or not, it can be easily characterized as:
For each band, choose interval end-points, e0=0, e1, ..., en+1=256,
then the items are ( bi, [ei,ei+1 ) ), i=0,..n
(in the equilength case there is a common length, ei - e(i-1) = a constant),
******************************************************************
* These notes contain NDSU confidential and proprietary material.*
******************************************************************
We consider a data structure which is particulary well suited for data mining
spatial data. Assume the data is in bSQ organization, B11,..,Bn8
(a separate file for each bit position of each band).
It is common practice to reduce data volume by truncating off a certain
number of low-order bits of each byte. Thus we will speak of "8-bit values"
(they are the full byte values), 7-bit values (with the low order bit truncated)
6-bit values (low-order two bits truncated)....
And assume each band has been separated by
bit-position into 8 "bit-bands" or bit vectors.)
Let the ith bit band of the kth band be denoted, Bki.
Each bit-band can be represented using a spatial data structure called a
P-tree (for Peano-tree). These P-trees are formulated to facilitate data
mining of spatial byte-bands.
***********************************************************
The Peano Count Tree (P-tree) concept and algebra is *
NDSU confidential and proprietary material. *
***********************************************************
The Peano-Tree for Bij is a lossless tree representation of the bit band
from which the bit band can be completely reconstructed and which also
contains the 1-bit count for each and every quadrant in the original space.
Example:
Suppose we have a band, Bk, in a 64 pixel space (8 rows by 8 columns):
11110001 10010010 11100011 11010101 10000000 11100101 01111000 00110011
10110001 11010011 11101010 11000001 11100100 00101101 00011110 01010101
11010001 10010010 11100011 11010101 10000000 11100101 01111000 00110011
10010001 11010011 11101010 11000001 11100100 10101101 10011110 01010101
11110001 10010010 11100011 11010101 10001110 11100101 11111000 10110011
10110001 11010011 11101010 11000001 11100110 10101101 10011110 11011101
11010001 10010010 11100011 10011101 10001010 11100101 11111000 10110111
00010001 11010011 11101010 10101001 11101100 10101101 10011110 11010101
Consider the bit-band, Bk1 of the above band file:
Bk1
1111 1100
1111 1000
1111 1100
1111 1110
1111 1111
1111 1111
1111 1111
0111 1111
Which, of course, when saved on disk in raster sequence is:
1111 1100 1111 1000 1111 1100 1111 1110 1111 1111 1111 1111 1111 1111 0111 1111
For this bit-band, the P-tree structure is:
Pk1 55
____________// \\___________
/ __/ \_ \
16 8 15 16
___//|\ /|\\__
/ / | \ / | \ \
3 0 4 1 4 4 3 4
1110 0010 1101
Here is how we arrive at it:
The root holds the count of 1-bits for its quadrant
(which is the entire bit array).
11111100
11111000
11111100
11111110 count=55
11111111
11111111
11111111
01111111
Each inode has the 1-bit count for its quadrant (order inodes at each level
using Peano ordering (recursive raster ordering) or ul, ur, ll, lr).
cnt=16 - > cnt=8
1111 1100
1111 1000
1111 1100
1111 1110
/
/
/
/
/
/
v
cn=15 - > cnt=16
1111 1111
1111 1111
1111 1111
0111 1111
giving:
55
____________// \\___________
/ __/ \_ \
16 8 15 16
Note that the ul and lr quadrants at this level need no further detailing
since they are entirely 1-bits. Thus, the tree ends here for those quadrants.
For the ur quadrant:
1100
1000
1100
1110
recursively, we count 1-bits for the subquadrants in raster order:
cnt=4 - > cnt=0
11 00
10 00
/
/
/
/
v
cnt=4 - > cnt=1
11 00
11 10
giving: 55
____________// \\___________
/ __/ \_ \
16 8 15 16
___//|\
/ / | \
3 0 4 1
We note that only the ul and lr subquandrants need detailing and we
detail by listing the bits in raster order (this is a recursive step
also since we are now down to 1x1 quadrants and the count is either 1 or 0).
55
____________// \\___________
/ __/ \_ \
16 8 15 16
___//|\
/ / | \
3 0 4 1
1110 0010
For the ll quadrant:
1111
1111
1111
0111
recursively, we count 1-bits for the subquadrants in raster order:
cnt=4 - > cnt=4
11 11
11 11
/
/
/
/
/
v
cnt=3 - > cnt=4
11 11
01 11
giving: 55
____________// \\___________
/ __/ \_ \
16 8 15 16
___//|\ /|\\__
/ / | \ / | \ \
3 0 4 1 4 4 3 4
1110 0010
Finally, only the ll subquandrant need detailing:
55
____________// \\___________
/ __/ \_ \
16 8 15 16
___//|\ /|\\__
/ / | \ / | \ \
3 0 4 1 4 4 3 4
1110 0010 1101
If we complete the counts for all subquadrants, the leaf sequence is
just the well-known Peano ordering sequence for the bit-band.
(thus the terminology, "Peano Count Tree")
(and we can think of the P-tree as a compressed form of the Peano sequence
with the addition of having all quadrant 1-counts as well)
55
_________________/| \\_______________________________
/ | \____________ \
16 8 15 16
_____//\\____ _____//\\____ _____//\\____ _____//\\____
/ / \ \ / / \ \ / / \ \ / / \ \
4 4 4 4 3 0 4 1 4 4 3 4 4 4 4 4
1111 1111 1111 1111 1110 0000 1111 0010 1111 1111 1101 1111 1111 1111 1111 1111
Start_here
|
v
1-1 1-1 /1-1 0-0
/ / / / / / /
1-1/ 1-1 | 1-0/ 0-0
______/ | ______/
/ | /
1-1 1-1 | 1-1 0-0
/ / / / / / /
1-1/ 1-1' 1-1/ 1-0
___________________/
/
1-1 1-1 /1-1 1-1
/ / / / / / /
1-1/ 1-1 | 1-1/ 1-1
______/ | ______/
/ | /
1-1 1-1 | 1-1 1-1
/ / / / / / /
0-1/ 1-1' 1-1/ 1-1 <-End_here
Peano ordering is a "space filling" curve ordering which can be thought of as
"recursive raster ordering" (recursing over ever increasing quadrant sizes).
Hilbert is another "space filling" ordering which preserves distances better
than Peano (every move is to a neighbor). It may result in better compression
but it does not appear to be as useful for our purposes.
Start_here
|
v
1-1 1-1--1 1--0-0
| | | | |
1-1 1-1 1-0 0-0
| | |
1 1--1 1 1-1 0-0
| | | | | | |
1-1 1-1 1 1--1-0
|
|
1-1 1-1 1 1--1-1
| | | | | | |
1 1--1 1 1-1 1-1
| | |
1-1 1-1 1-1 1-1
| | | | |
0-1 1-1--1 1--1-1
^
|
End_here
Here is a discussion of the two orderings:
ab cd ef gh
ij kl mn op
qr st uv wx
yz 01 23 45
AB CD EF GH
IJ KL MN OP
QR ST UV WX
YZ 67 89 -+
Peano ordering follows the pattern:
abij cdkl qryz st01 efmn ghop uv23 wx45 ABIJ CDKL QRYZ ST67 EFMN GHOP UV89 WX-+
Hilbert ordering follows the pattern:
abji qyzr s01t lkcd emnf ghpo wx54 3vu2 EMNF GHPO WX+- 9VU8 76ST LDCK JBAI QRZY
Note that in Hilbert, every move is to a neighbor. It is the best ordering.
The bottom up construction is:
_
The pattern is: starting with _| in the upper left 4-value quadrant,
1. rotate along y=-x axis and drag down (completing an 8-value pattern)
2. fold the 8-value pattern to the right (completing a 16-value pattern
- the 4x4 upper left quadrant)
3. rotate the 16-value pattern (along y=-x always) and drag right
4. fold the 32-value pattern down
5. rotate the 64-value pattern and drag down
6. fold the 128-value pattern right
7. rotate the 256-value pattern and drag right
8. fold the 512-value pattern down
. . .
So the 3 bitvectors with the orderings (reorderings) are:
Raster:
1111 1100 1111 1000 1111 1100 1111 1110 1111 1111 1111 1111 1111 1111 0111 1111
Peano:
1111 1111 1111 1111 1110 0000 1111 0010 1111 1111 1101 1111 1111 1111 1111 1111
Hilbert:
1111 1111 1111 1111 1101 0000 0001 1111 1111 1111 1111 1111 1111 1111 1111 1110
To address quadrants we use a Quadrant-ID scheme:
- First assign level numbers to the quadrants (and the P-tree levels)
The root is Level-n if there are 4^n elements (2^n X 2^n quadrants) in the space.
Each quadrant at Level-i has 4^i elements (2^i X 2^i quadrants).
Level-0 is the lowest level - the leaf level (2^0 X 2^0 or quadrants)
- Assign 2-bit addresses to the quadrants within each level:
ul=00 ur=01 ll=10 lr=11
- A quadrant is identified by the sequence of its 2-bit addresses
(along its inodes in its path):
- Therefore the ul subquadrant of the lr subquandrant of the ur
subquandrant of the bit-band has QID: 01.10.00
.--------.
| | L3
_____________________/`--------'\___________________
/00 01 / \ 10 \ 11
.------. .------. .------. .------.
/`------'\ /`------'\ /`------'\ /`------'\ L2
/ | | \ / | | \ / | | \ / | | \
/ | | \ / | | \ / | | \ / | | \
00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11
.--. .--. .--. .--. .--. .--. .--. .--. .--. .--. .--. .--. .--. .--. .--. .--L1
`--' `--' `--' `--' `--' `--' `--' `--' `--' `--' `--' `--' `--' `--' `--' `--'
----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|---L0
^
|
01.10.00 is QID of this 1x1 quadrant (or writing it in non-binary as 1.2.0)
Now continuing with the definitions of Peano Count Trees:
For the bit band, Bk1:
1111 1100
1111 1000
1111 1100
1111 1110
1111 1111
1111 1111
1111 1111
0111 1111
We have P-tree, Pk1 (assuming the above data comes from band-k
and is the high order bit or bit-1 of that band)
Pk1 55
____________// \\___________
/ __/ \_ \
16 8 15 16
___//|\ /|\\__
/ / | \ / | \ \
3 0 4 1 4 4 3 4
1110 0010 1101
There is also Pk,0 for the other possible 1-bit value, 0.
Usually we don't express it because it is so easily derivable
from Pk,1 as the complement:
Pk0 9
____________// \\___________
/ __/ \_ \
0 8 1 0
___//|\ /|\\__
/ / | \ / | \ \
1 4 0 3 0 0 1 0
0001 1101 0010
Single bit P-trees can be built for the other bit positions as well:
Pk2 (using the second high-order bit instead of the 1st).
...
Pk8 (using the 8th high order bit (or lowest order bit).
11110001 10010010 11100011 11010101 10000000 11100101 01111000 00110011
10110001 11010011 11101010 11000001 11100100 00101101 00011110 01010101
11010001 10010010 11100011 11010101 10000000 11100101 01111000 00110011
10010001 11010011 11101010 11000001 11100100 10101101 10011110 01010101
11110001 10010010 11100011 11010101 10001110 11100101 11111000 10110011
10110001 11010011 11101010 11000001 11100110 10101101 10011110 11011101
11010001 10010010 11100011 10011101 10001010 11100101 11111000 10110111
00010001 11010011 11101010 10101001 11101100 10101101 10011110 11010101
Bk2:
1011 1100
0111 1000
1011 1100
0111 1110
1011 1111
0111 1111
1010 1111
0110 1111
Pk2: 46
_______________// \\___________
/ __/ \_ \
____12 8 10 16
/ /|\ ___//|\ /|\\__
/ / | \ / / | \ / | \ \
2 4 2 4 3 0 4 1 2 4 2 2
1001 1001 1110 0010 1001 1001 1010
Bk3:
1010 0111
1010 1100
0010 0111
0010 1100
1010 0111
1010 1100
0010 0111
0011 1100
Pk3: 33
_________________// \\__________________
/ __/ \_____ \
____ 6 10 7 _ 10 ________
/ / |\ ___//\\____ / | \_\__ /\\____ \
/ / | \ / / \ \ / | \ \ / \ \ \
2 2 0 2 3 2 3 2 2 2 0 3 3 2 3 2
1010 1010 1010 0111 1100 0111 1100 1010 1010 1011 0111 1100 0111 1100
Bk4:
1101 0011
1100 0011
1101 0011
1100 0011
1101 0011
1100 0011
1101 0011
1100 0011
Pk4: 36
_________________// \\___________________
/ __/ \______ \
____ 10 8 _10 8 ________
/ / |\ ___//\\____ / | \_\__ /\\____ \
/ / | \ / / \ \ / | \ \ / \ \ \
4 1 4 1 0 4 0 4 4 1 4 1 0 4 0 4
0100 0100 0100 0100
Bk5:
0000 0010
0010 0110
0000 0010
0010 0110
0000 1010
0010 0111
0001 1010
0011 1110
Pk5: 22
_________________// \\___________________
/ __/ \______ \
____ 2 6 4 10 ________
/ / |\ ___//\\____ / | \_\__ /\\____ \
/ / | \ / / \ \ / | \ \ / \ \ \
0 1 0 1 1 2 1 2 0 1 0 3 2 3 3 2
0010 0010 0001 1010 0001 1010 0010 0111 1001 1011 1011 1010
Bk6:
0001 0100
0000 1111
0001 0100
0000 1111
0001 1100
0000 1111
0001 0101
0000 1111
Pk6: 26
_________________// \\___________________
/ __/ \______ \
____ 2 10 2 12________
/ / |\ ___//\\____ / | \_\__ /\\____ \
/ / | \ / / \ \ / | \ \ / \ \ \
0 1 0 1 3 2 3 2 0 1 0 1 4 2 3 3
0100 0100 0111 0011 0111 0011 0100 0100 0011 0111 0111
Bk7:
0110 0001
0110 0010
0110 0001
0110 0010
0110 1001
0110 1010
0110 1001
0110 0010
Pk7: 27
___________________// \\___________________
/ __/ \______ \
____ 8 4 8 7 ________
/ / |\__ ___//\\____ / | \_\__ /\\____ \
/ / | \ / / \ \ / | \ \ / \ \ \
2 2 2 2 0 2 0 2 2 2 2 2 2 2 1 2
0101 1010 0101 1010 0110 0110 0101 1010 0101 1010 1010 0110 1000 0110
Bk8:
1011 0101
1101 0101
1011 0101
1101 0101
1011 0101
1101 0101
1011 0101
1101 0101
Pk8: 39
___________________// \\________________________
/ __/ \_______ \
____12 8 _ 11 8 ________
/ / |\__ ___/ /\\____ / | \_\____ /\\____ \
/ / | \ / / \ \ / | \ \ / \ \ \
3 3 3 3 2 2 2 2 3 3 3 2 2 2 2 2
1011 1101 1011 1101 0101 0101 0101 0101 1011 1101 1011 1101 0101 0101 0101 0101
These 8 P-trees are called the "basic P-trees"
The basic P-trees can be combined together to produce other
useful P-trees (including the original data again)
There is an "algebra" on the universe of P-trees for a spatial dataset.
It includes unary operator COMP (complement, sometimes denoted by ')
and binary operators, AND, OR, XOR, etc.
COMP is done as follows:
At level-i, replace each count, c, by (4^i - c)
AND is done as follows:
For Pk,v AND Pk,j: Working down in depth-first order from the root until
you reach a pure quadrant, Q (pure0 means all 0's & pure1 means all 1's)
if the Pk,v branch terminates in pure1's at Q, Pk,v AND Pk,j|Q = Pk,j|Q
elseif the Pk,j branch terminates in pure1's at Q, Pk,v AND Pk,j|Q = Pk,v|Q
elseif either termininates in a quadrant of pure0's at Q, Pk,v AND Pk,j|Q = 0|Q
OR is done as follows:
For Pk,v OR Pk,j: Working down in depth-first order from the root until
you reach a pure quadrant, Q (pure0 means all 0's & pure1 means all 1's)
if the Pk,v branch terminates in pure0's at Q, Pk,v OR Pk,j|Q = Pk,j|Q
elseif the Pk,j branch terminates in pure0's at Q, Pk,v OR Pk,j|Q = Pk,v|Q
elseif either termininates in a quadrant of pure1's at Q, Pk,v OR Pk,j|Q = 1|Q
XOR is done as follows:
For Pk,v XOR Pk,j: Working down in depth-first order from the root until
you reach a pure quadrant, Q (pure0 means all 0's & pure1 means all 1's)
if the Pk,v branch terminates in pure0's at Q, Pk,v XOR Pk,j|Q = Pk,j|Q
elseif the Pk,j branch terminates in pure0's at Q, Pk,v XOR Pk,j|Q = Pk,v|Q
elseif the Pk,v branch terminates in pure1's at Q, Pk,v XOR Pk,j|Q = COMP(Pk,j)|Q
elseif the Pk,j branch terminates in pure1's at Q, Pk,v XOR Pk,j|Q = COMP(Pk,v)|Q
We can construct the P-trees for 2-bit values
(gives all quadrant counts of "hits" on the particular 2-bit value).
- Pk,11 (PCT for 11) = Pk1 AND Pk2
- Pk,01 (PCT for 01) = COMP(Pk1) AND Pk2
- Pk,10 (PCT for 10) = Pk1 AND COMP(Pk2)
- Pk,00 (PCT for 00) = COMP(Pk1) AND COMP(Pk2)
and for 3-bit values, 4-bit values, etc.
Pk,11 = Pk,1 AND Pk,2
Pk,01 = Pk,1' AND Pk,2
Pk,10 = Pk,1 AND Pk,2'
Pk,00 = Pk,1' AND Pk,2'
Pk,111= Pk,1 AND Pk,2 AND Pk,3 = Pk,11 AND Pk,3
Pk,101= Pk,1 AND Pk,2' AND Pk,3 = Pk,10 AND Pk,3
Pk,011= Pk,1' AND Pk,2 AND Pk,3 = Pk,01 AND Pk,3
Pk,001= Pk,1' AND Pk,2' AND Pk,3 = Pk,00 AND Pk,3
Pk,110= Pk,1 AND Pk,2 AND Pk,3' = Pk,11 AND Pk,3'
Pk,100= Pk,1 AND Pk,2' AND Pk,3' = Pk,10 AND Pk,3'
Pk,010= Pk,1' AND Pk,2 AND Pk,3' = Pk,01 AND Pk,3'
Pk,000= Pk,1' AND Pk,2' AND Pk,3' = Pk,00 AND Pk,3'
Pk,1111=Pk,1 AND Pk,2 AND Pk,3 AND Pk,4 = Pk,111 AND Pk,4
Pk,1011=Pk,1 AND Pk,2' AND Pk,3 AND Pk,4 = Pk,101 AND Pk,4
Pk,0111=Pk,1' AND Pk,2 AND Pk,3 AND Pk,4 = Pk,011 AND Pk,4
Pk,0011=Pk,1' AND Pk,2' AND Pk,3 AND Pk,4 = Pk,001 AND Pk,4
Pk,1101=Pk,1 AND Pk,2 AND Pk,3' AND Pk,4 = Pk,110 AND Pk,4
Pk,1001=Pk,1 AND Pk,2' AND Pk,3' AND Pk,4 = Pk,100 AND Pk,4
Pk,0101=Pk,1' AND Pk,2 AND Pk,3' AND Pk,4 = Pk,010 AND Pk,4
Pk,0001=Pk,1' AND Pk,2' AND Pk,3' AND Pk,4 = Pk,000 AND Pk,4
Pk,1110=Pk,1 AND Pk,2 AND Pk,3 AND Pk,4'= Pk,111 AND Pk,4'
Pk,1010=Pk,1 AND Pk,2' AND Pk,3 AND Pk,4'= Pk,101 AND Pk,4'
Pk,0110=Pk,1' AND Pk,2 AND Pk,3 AND Pk,4'= Pk,011 AND Pk,4'
Pk,0010=Pk,1' AND Pk,2' AND Pk,3 AND Pk,4'= Pk,001 AND Pk,4'
Pk,1100=Pk,1 AND Pk,2 AND Pk,3' AND Pk,4'= Pk,110 AND Pk,4'
Pk,1000=Pk,1 AND Pk,2' AND Pk,3' AND Pk,4'= Pk,100 AND Pk,4'
Pk,0100=Pk,1' AND Pk,2 AND Pk,3' AND Pk,4'= Pk,010 AND Pk,4'
Pk,0000=Pk,1' AND Pk,2' AND Pk,3' AND Pk,4'= Pk,000 AND Pk,4'
. . .
Pk,00000000=Pk,1' & Pk,2' & Pk,3' & Pk,4' & Pk,5' & Pk,6' & Pk,7' & Pk,8'
. . .
Actual storage might be done as: (assume Ln = root, ie, 4^n pixels)
Breadth-first layout is a structure with n+1 elements (one for each level).
Ln: a 1+2*n bit field (to hold counts up to 4^n)
L(n-1): a 1+2*(n-1) bit field for each of the L(n-1)-quadrants if the root
is not pure ("mixed" root)
L(n-2): a 1+2*(n-2) bit field for each of the L(n-2)-quadrants whose L(n-1)
. . . parent is not pure ("mixed" parent)
Lk: a 1+2*k bit field for each of the Lk-quadrants whose L(k+1) parent
. . . is not pure ("mixed" parent)
L1: a 1+2*1=3 bit field for each of the L1-quadrants whose L2 parent
is not pure ("mixed" parent)
L0: a 1+2*0=1 bit field for each of the L0-quadrants whose
L1 parent is not pure ("mixed" parent)
Depth-first layout:
a 1+2*n bit field for the root-count. If the root is not pure, it is followed by
a 1+2*(n-1) bit field the 0th L(n-1)-quadrant.
if it is pure, a 1+2*(n-1) bit field the 1st L(n-1)-quadrant,
if it is pure, a 1+2*(n-1) bit field the 2nd L(n-1)-quadrant,
if it is pure, a 1+2*(n-1) bit field the 3rd L(n-1)-quadrant,
else a 1+2*(n-2) bit field its 0th L(n-2)-quadrant,
if it is pure, a 1+2*(n-2) bit field the 1st L(n-2)-quadrant,
...
else a 1+2*(n-2) bit field its 0th L(n-2)-quadrant,
...
We will use breadth-first layout.
Then we actually store:
Pk,1 55 L3
____________// \\___________
/ __/ \_ \
16 8 15 16 L2
___//|\ /|\\__
/ / | \ / | \ \
3 0 4 1 4 4 3 4 L1
1110 0010 1101 L0
as:
0110111
10000 01000 01111 10000
011 000 100 001 100 100 011 100
1110 0010 1101
Next we note that there is an even more concise storage method
using this same depth-first layout. Instead of storing quadrant 1-counts we
can simply store a "purity indicator". Then the counts can be quickly constructed
from the purity mask tree (PMT) structure. At each node in the PMT,
we will be to use 3-value logic, 11=pure1; 00=pure0; and 01=mixed quadrants.
Except at Level-0 where there are no mixed quadrants so we can use
1=pure1 and 0=pure0.
PMTk1 01 L3
____________// \\___________
/ __/ \_ \
11 01 01 11 L2
___//|\ /|\\__
/ / | \ / | \ \
01 00 11 01 11 11 01 11 L1
1110 0010 1101 L0
store as:
01
11 01 01 11
01 00 11 01 11 11 01 11
1110 0010 1101
(for human understanding we can replace 2-bit symbols with
1-char symbols: 1=pure1; 0=pure0; m=mixed):
PMTk1 m L3
____________// \\___________
/ __/ \_ \
1 m m 1 L2
___//|\ /|\\__
/ / | \ / | \ \
m 0 1 m 1 1 m 1 L1
1110 0010 1101 L0
store as:
m
1 m m 1
m 0 1 m 1 1 m 1
1110 0010 1101
Before going further we will note here that we actually AND
two of these using a depth-first AND algorithm on these breadth-first layouts:
PMT1: m
1 m m 1
m 0 1 m 1 1 m 0
1110 0010 1101
PMTk2: m
m 1 m 0
0 0 m 1 m m m 0
1101 0110 0101 1110
then the PMTk1 AND PMTk2: root is m
Next, descend depth-first to:
v PMTk1
1 m m 1
m 0 1 m 1 1 m 0
1110 0010 1101
v PMTk2
m 1 m 0
0 0 m 1 m m m 0
1101 0110 0101 1110
Since quadrant (PMTk1)0 is pure1, (PMTk1 AND PMTk2)0 is
(PMTk2)0 (the part of PMTk2 to the left of the line):
m|1 m 0
|______
0 0 m 1 | m m m 0
____|
1101| 0110 0101 1110
Thus, so far,
PMTk1 AND PMTk2 has root, m, and lower levels:
m
0 0 m 1
1101
Next, descend depth-first to:
v PMTk1
1|m m 1
_|
m 0 1 m 1 1 m 0
1110 0010 1101
v PMTk2
m|1 m 0
|______
0 0 m 1 | m m m 0
____|
1101| 0110 0101 1110
Since quadrant (PMTk2)1 is pure1, (PMTk1 AND PMTk2)1 is
(PMTk1)1 (the part of PMTk1 between the lines):
v PMTk1
1|m|m 1
_| |____
m 0 1 m |1 1 m 0
|_
1110 0010 | 1101
Thus, so far,
PMTk1 AND PMTk2 has root, m, and lower levels:
m m
0 0 m 1 m 0 1 m
1101 1110 0010
Next, descend depth-first to:
v PMTk1
1 m|m 1
|____
m 0 1 m |1 1 m 0
|_
1110 0010 | 1101
v PMTk2
m 1|m 0
|____
0 0 m 1 | m m m 0
____|
1101| 0110 0101 1110
Since both are mixed, install m in PMTk1 AND PMTk2 and then descend another level:
1 m|m 1 PMTk1
|____ v
m 0 1 m |1 1 m 0
|_
1110 0010 | 1101
m 1|m 0 PMTk2
|____ v
0 0 m 1 | m|m m 0
____| |
1101| 0110 |0101 1110
Since quadrant (PMTk1)2.0 is pure1, (PMTk1 AND PMTk2)2.0 is
(PMTk2)2.0 (the part of PMTk2 between the lines):
Thus, so far,
PMTk1 AND PMTk2 has root, m, and lower levels:
m m m
0 0 m 1 m 0 1 m m
1101 1110 0010 0110
Next, descend depth-first:
1 m|m 1 PMTk1
|______ v
m 0 1 m 1|1 m 0
|
1110 0010 | 1101
m 1|m 0 PMTk2
|_______ v
0 0 m 1 m|m|m 0
| |__
1101 0110 |0101| 1110
Since quadrant (PMTk1)2.1 is pure1, (PMTk1 AND PMTk2)2.1 is
(PMTk2)2.1 (the part of PMTk2 between the lines):
Thus, so far,
PMTk1 AND PMTk2 has root, m, and lower levels:
m m m
0 0 m 1 m 0 1 m m m
1101 1110 0010 0110 0101
Next, descend depth-first:
1 m|m 1 PMTk1
|________ v
m 0 1 m 1 1|m 0
|
1110 0010 |1101
m 1|m 0 PMTk2
|_________ v
0 0 m 1 m m|m 0
|__
1101 0110 0101| 1110
Since both are mixed, install m and descend
(which at L0 is just to AND the nibbles):
1 m|m 1 PMTk1
|________
m 0 1 m 1 1|m 0
|v
1110 0010 |1101
m 1|m 0 PMTk2
|_________
0 0 m 1 m m|m 0
|__ v
1101 0110 0101| 1110
Thus, so far,
PMTk1 AND PMTk2 has root, m, and lower levels:
m m m
0 0 m 1 m 0 1 m m m m
1101 1110 0010 0110 0101 1100
Next, descend depth-first:
1 m|m 1 PMTk1
|__________ v
m 0 1 m 1 1 m|0
|___
1110 0010 1101 |
m 1|m 0 PMTk2
|___________ v
0 0 m 1 m m m|0
|______
1101 0110 0101 1110|
Since quadrant (PMTk1)2.3 is pure0, (PMTk1 AND PMTk2)2.3 is pure0
Thus,
PMTk11 = PMTk1 AND PMTk2 has root, m, and lower levels:
m m m 0
0 0 m 1 m 0 1 m m m m 0
1101 1110 0010 0110 0101 1100
Implementations notes:
- The lines can be replaced by pointers or cursors
- One can view this entirely in terms of shifting cells
from one of the operands to the result:
v
1 m m 1 PMTk1
m 0 1 m 1 1 m 0
1110 0010 1101
v
m 1 m 0 PMTk2
0 0 m 1 m m m 0
1101 0110 0101 1110
Since quadrant (PMTk1)0 is pure1, (PMTk11)0 is
(PMTk2)0 (shift subtree to the left of the line to result)
m|1 m 0
|______
0 0 m 1 | m m m 0
____|
1101| 0110 0101 1110
Thus, so far, PMTk11 is:
m
0 0 m 1
1101
Next, shift from Pk1.1 to Pk11.1 (since Pk2.1 is pure1)
v
m|m 1 PMTk1
|______
m 0 1 m |1 1 m 0
|_
1110 0010 |1101
v
1 m 0 PMTk2
m m m 0
0110 0101 1110
Thus, so far, PMTk11 is:
m m
0 0 m 1 m 0 1 m
1101 1110 0010
Next, shift m to PMTk11.2 from both and descend (since both PMTki.2 are mixed)
v
m 1 PMTk1
1 1 m 0
1101
v
m 0 PMTk2
m m m 0
0110 0101 1110
Thus, so far, PMTk11 is:
m m m
0 0 m 1 m 0 1 m
1101 1110 0010
Next, shift from PMTk2.2.0 to PMTk11.2.0 (since PMTk1.2.0 is pure1)
1 PMTk1
v
1 1 m 0
1101
0 PMTk2
v
m|m m 0
|__
0110|0101 1110
Thus, so far, PMTk11 is:
m m m
0 0 m 1 m 0 1 m m
1101 1110 0010 0110
Next, shift from PMTk2.2.1 to PMTk11.2.1 (since PMTk1.2.1 is pure1)
1 PMTk1
v
1 m 0
1101
0 PMTk2
v
m| m 0
|__
0101|1110
Thus, so far, PMTk11 is:
m m m
0 0 m 1 m 0 1 m m m
1101 1110 0010 0110 0101
Next, shift m from both to PMTk11.2.2 and descend (since both PMTki.2.2 are mixed)
(since the descent is to L0, AND)
1 PMTk1
v
m 0
1101
0 PMTk2
v
m 0
1110
Thus, so far, PMTk11 is:
m m m
0 0 m 1 m 0 1 m m m m
1101 1110 0010 0110 0101 1100
Next, shift 0 from PMTk1.2.3 to PMTk11.2.3 (since both PMTk1.2.3 is pure0)
1 PMTk1
v
0
0 PMTk2
v
0
Thus, so far, PMTk11 is:
m m m
0 0 m 1 m 0 1 m m m m 0
1101 1110 0010 0110 0101 1100
Descend (since at L0, ascend) Next, shift 0 from PMTk2.3 to PMTk11.3 (PMTk2.3 is pure0)
v
1 PMTk1
v
0 PMTk2
Thus, so far, PMTk11 is:
m m m 0
0 0 m 1 m 0 1 m m m m 0
1101 1110 0010 0110 0101 1100
*******************************************
A final storage arrangement is uncompressed PMTs using 4-value logic
and breadth-first layout (called PMTbr for "breadth-first and redundant):
00=pure0 run
11=pure1 run
01=mixed run
10=uncompressed bit segment
For human readability we will enclose runlengths in:
() for pure0 run
[] for pure1 run
{} for mixed run
<> for uncompressed segment
PMT11:
m L5
1mm0 L4
01mm 001m L3
0m10 0010 1m10 L2
01m0 0100 L1
0011 L0
becomes
PMTbr11:
{1}
[1] {2} (1)
[4] (1) [1] {2} (2) [1] {1} (4)
[16] (4) [4] (1) {1} [1] (3) [1] (9) [5] {1} [1] (17)
[64] (16)[16](5) [1]{1} (1)[4] (12)[4] (36) [20](1)[1](2)[4] (68)
[256](64)[64](20)[4](2)[2](4)[16](48)[16](144)[80](4)[4](8)[16](272)
and
PMTbr12:
{1}
{1} (1) {1} [1]
[1] (2) {1} (4) [2] (2) [4]
[4] (9) {2} [1] (16) [8] (8) [16]
[16](38) [1]{3} [6] (64) [32] (32) [64]
[64](152)[6](5)[4](1)[24](256)[128](128)[256]
ANDing these:
{1} & [1} ={1}
[1]{2}(1) & {1}(1){1}[1] ={1}(1){1}(1)
[4](1)[1]{2}(2)[1]{1}(4) & [1](2){1}(4)[2](2)[4] =[1](2){1}(12)
[64](16)[16](5)[1]{1}(1)[4](12)[4](36)[20](1)[1](2)[4](68) &
[16](38)[1]{3}[6](64)[32](32)[64] =[16](38)[1]{3}[6](192)
[256](64)[64](20)[4](2)[2](4)[16](48)[16](144)[80](4)[4](8)[16](272) &
[64](152)[6](5)[4](1)[24](256)[128](128)[256] =[64](152)[6](5)[4](1)[24](768)
ANDing and ORing are as above (except that OR with purei is just the reverse of
AND with purei)
To complement, COMP(PMT): swap () and []
To extract a subquadrant, say, qid 0.0.2.3, from PMT12:
{1} L5
{1} (1) {1} [1] L4
[1] (2) {1} (4) [2] (2) [4] L3
[4] (9) {2} [1] (16) [8] (8) [16] L2
[16](38) [1]{3} [6] (64) [32] (32) [64] L1
[64](152)[6](5)[4](1)[24](256)[128](128)[256] L0
cut final 3 from L4, final 12 from L3, final 48 from L2, final 192 from L1,
final 768 from L0 (due to 0.0 qid segment)
{1} L5
{1} L4
[1] (2) {1} L3
[4] (9) {2} [1] L2
[16](38) [1]{3} [6] L1
[64](152)[6](5)[4](1)[24] L0
cut initial 2 from L3, initial 8 from L2, initial 32 from L1,
initial 128 from L0 and cut final 1 from L3, final 4 from L2,
final 16 from L1, final 64 from L0 (due to 0.0.2 qid segment)
(1) L3
(4) L2
(16) L1
(64) L0
cut initial 3 from L2, initial 12 from L1,
initial 48 from L0 (due to 0.0.2.3 qid segment)
(1) L2
(4) L1
(16) L0
*****************************************
A simpler example (with only 16 pixels but 4-bit values):
X-Y B1 B2 B3 B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011
B11 B12 B13 B14
0000 0011 1111 1111
0000 0011 1111 1111
0011 0001 1111 0001
0111 0011 1111 0011
P1,1 The "basic" P1,2 P1,3 P1,4
5 datastructures 7 16 11
0 0 1 4 0 4 0 3 4 4 0 3
0001 0111 0111
P1,00 P1,01 P1,10 P1,11
7 4 2 3
4 0 3 0 0 4 0 0 0 0 1 1 0 0 0 3
1110 0001 1000 0111
P1,000 P1,010 P1,100 P1,110 P1,001 P1,011 P1,101 P1,111
0 0 0 0 7 4 2 3
4 0 3 0 0 4 0 0 0 0 1 1 0 0 0 3
1110 0001 1000 0111
P1,0000 P1,0100 P1,1000 P1,1100 P1,0010 P1,0110 P1,1010 P1,1110
0 0 0 0 3 0 2 0
0 0 3 0 0 0 1 1
1110 0001 1000
P1,0001 P1,0101 P1,1001 P1,1101 P1,0011 P1,0111 P1,1011 P1,1111
0 0 0 0 4 4 0 3
4 0 0 0 0 4 0 0 0 0 0 3
0111
B21 B22 B23 B24
0000 1000 1111 1110
0000 1000 1111 1110
1111 0000 1111 1100
1111 0000 1111 1100
P2,1 P2,2 P2,3 P2,4
8 2 16 10
0 0 4 4 2 0 0 0 4 4 4 4 4 2 4 0
1010 1010
P2,00 P2,01 P2,10 P2,11
6 2 8 0
2 4 0 0 2 0 0 0 0 0 4 4
0101 1010
P2,000 P2,010 P2,100 P2,110 P2,001 P2,011 P2,101 P2,111
0 0 0 0 6 2 8 0
2 4 0 0 2 0 0 0 0 0 4 4
0101 1010
P2,0000 P2,0100 P2,1000 P2,1100 P2,0010 P2,0110 P2,1010 P2,1110
0 0 0 0 2 0 4 0
0 2 0 0 0 0 0 4
0101
P2,0001 P2,0101 P2,1001 P2,1101 P2,0011 P2,0111 P2,1011 P2,1111
0 0 0 0 4 2 4 0
2 2 0 0 2 0 0 0 0 0 4 0
0101 1010 1010
B31 B32 B33 B34
1100 0011 0000 0001
1100 0011 0000 0001
1100 0011 0000 0000
1100 0011 0000 0000
P3,1 P3,2 P3,3 P3,4
8 8 0 2
4 0 4 0 0 4 0 4 0 2 0 0
0101
P3,00 P3,01 P3,10 P3,11
0 8 8 0
0 4 0 4 4 0 4 0
P3,000 P3,010 P3,100 P3,110 P3,001 P3,011 P3,101 P3,111
0 8 8 0 0 0 0 0
0 4 0 4 4 0 4 0
P3,0000 P3,0100 P3,1000 P3,1100 P3,0010 P3,0110 P3,1010 P3,1110
0 6 8 0 0 0 0 0
0 2 0 4 4 0 4 0
1010
P3,0001 P3,0101 P3,1001 P3,1101 P3,0011 P3,0111 P3,1011 P3,1111
0 2 0 0 0 0 0 0
0 2 0 0
0101
B41 B42 B43 B44
1111 0100 1111 1111
1111 0000 1111 1111
1111 1100 1111 1111
1111 1100 1111 1111
P4,1 P4,2 P4,3 P4,4
16 5 16 16
1 0 4 0
1010
P4,00 P4,01 P4,10 P4,11
0 0 11 5
3 4 0 4 1 0 4 0
0101 1010
P4,000 P4,010 P4,100 P4,110 P4,001 P4,011 P4,101 P4,111
0 0 0 0 0 0 11 5
3 4 0 4 1 0 4 0
0101 1010
P4,0000 P4,0100 P4,1000 P4,1100 P4,0010 P4,0110 P4,1010 P4,1110
0 0 0 0 0 0 0 0
P4,0001 P4,0101 P4,1001 P4,1101 P4,0011 P4,0111 P4,1011 P4,1111
0 0 0 0 0 0 11 5
3 4 0 4 1 0 4 0
0101 1010
How can these P-trees be used in Apriori Association Rule Mining?
The Apriori Algorithm for discovering all frequent itemsets;
First determine all frequent itemsets:
start will 1-itemsets,
then only consider unions of frequent 1-itemsets as candidate 2-itemsets, etc.
Next, for each frequent itemset found, search for the high-confidence rules it supports,
by trying all 1-item consequents first (so that the antecedent is maximal size),
then try 2-item consequents, but only those that are the union of 1-item consequents
of high-confidence rules. etc.
Assume that B1 is Yield, B2=blue, B3=green, B4=Red
Spatial Data mining
I={(b,v)|b=band=1..n, v=value=(1-bit or 2-bit or ...or 8 bit)}
T={pixels}
Admissible Itemsets (Asets)= {Int1 x Int2 x ... x Intn}
where Inti is an interval in Band-i (some may be the entire band).
Modeled on Apriori, we first find all frequent itemsets.
- pruned by specifying "restricted interest" (e.g., If Bn=Yield, user
may wish to restrict attention to those Asets for which Intn is not all
of Bn. At the 1-bit value level in the value concept hierarchy,
this means either y<128 or y>=128. Then the user may want to restrict
interest to those rules for which the consequent is Intn only)
For a frequent Aset, B=PROD(i=1..n)[Inti], rules are formed by partitioning
{1..n} into two disjoint sets, {i1..im} and {j1..jq} (note q+m=n)
and forming A=>C, where A=PROD(k=1..m)[Intik] and C=PROD(k=1..q)[Intjk]
As noted above, users may be interested only in rules where {j1..jq}
is a certain subset (such as {n}, above - then there is just one rule
of interest for each frequent set found and it must be checked as to
whether it is highconfidence or not). If this is the case we further
restrict our definition of Asets to only those itemsets.
- For restricted interest above, q=1, and C=Int1 (in the Yield band)
For rule, A=>C above, the support is the support of B and thus is the
count of pixels, p, such that p(i) is in all Inti, i=1..n.
The confidence of a rule, A=>C is its support divided by the support of A
(supp(A) is the count of pixels, p, such that p(i) is in all Intik, k=1..m
- In restricted interest case, with B1=Yield B2=blue, B3=green, B4=red,
we need to calculate supp(B)=supp(Int1xInt2xInt3xInt4); calculate
supp(A)=supp(Int2xInt3xInt4).
If supp(B) >= minsup and then supp(B)/supp(A) >= minconf,
then A=>B is a strong rule.
A k-band Aset (kAset) is an Aset in which k of the Inti intervals are non-full
(i.e., in k of the bands the intervals are restricted - i.e., not the fully
unrestricted intervals of all possible values)
We start by finding all frequent 1Asets.
Then the candidate 2Asets are those whose every 1Aset subset is frequent.
Etc., the candidate kAsets are those whose every (k-1)Aset subset is frequent.
That's the main pruning technique.
Next we look for a pruning technique based on the value concept hierarchy.
If we find all 1-bit frequent kAsets first, we can use the fact that a
2-bit kAset cannot be frequent if its "enclosing" 1-bit kAset is infrequent.
A 1-bit Aset "encloses" a 2-bit Aset if when the endpoints of the 2-bit Aset
are shifted right 1-bit position, it is a subset of the 1-bit Aset, etc.
Assume minsupp-50% and minconf=50%
1. FIND ALL FREQUENT 1Asets.
for 1-bit values
for B1
2 possibilities for Int1: [1,1] and [0,0]
P1,1
5
0 0 1 4
0001
supp([1,1]1)=5 not frequent
supp([0,0]1)=11 frequent
for B2
2 possibilities for Int2: [1,1] and [0,0]
P2,1
8
0 0 4 4
supp([1,1]2)=8 frequent
supp([0,0]2)=8 frequent
for B3
2 possibilities for Int3: [1,1] and [0,0]
P3,1
8
4 0 4 0
supp([1,1]3)=8 frequent
supp([0,0]3)=8 frequent
for B4
2 possibilities for Int4: [1,1] and [0,0]
P4,1
16
supp([1,1]4)=16 frequent
supp([0,0]4)=0 not frequent
1L1 (1-bit value frequent 1Asets):
supp([0,0]1)=11 frequent
supp([1,1]2)=8 frequent
supp([0,0]2)=8 frequent
supp([1,1]3)=8 frequent
supp([0,0]3)=8 frequent
supp([1,1]4)=16 frequent
1C2 (1-bit-value candidate 2Asets):
[0,0]1x[1,1]2 (support = root-count of P1,0 & P2,1 = 3, no)
[0,0]1x[0,0]2 (support = root-count of P1,0 & P2,0 = 8, yes)
[0,0]1x[1,1]3 (support = root-count of P1,0 & P3,1 = 7, no)
[0,0]1x[0,0]3 (support = root-count of P1,0 & P3,0 = 4, no)
[0,0]1x[1,1]4 (support = root-count of P1,0 & P4,1 = 11, yes)
[0,0]2x[1,1]3 (support = root-count of P2,0 & P3,1 = 4, no)
[0,0]2x[0,0]3 (support = root-count of P2,0 & P3,0 = 4, no)
[0,0]2x[1,1]4 (support = root-count of P2,0 & P4,1 = 8, yes)
[1,1]2x[1,1]3 (support = root-count of P2,1 & P3,1 = 4, no)
[1,1]2x[0,0]3 (support = root-count of P2,1 & P3,0 = 4, no)
[1,1]2x[1,1]4 (support = root-count of P2,1 & P4,1 = 8, yes)
[0,0]3x[1,1]4 (support = root-count of P3,0 & P4,1 = 8, yes)
[1,1]3x[1,1]4 (support = root-count of P3,1 & P4,1 = 8, yes)
from:
P1,0
11
4 4 3 0
1110
P2,1 P2,0
8 8
0 0 4 4 4 4 0 0
P3,1 P3,0
8 8
4 0 4 0 0 4 0 4
P4,1
16
1L2 (1-bit value frequent 2Asets):
[0,0]1x[0,0]2 (support = root-count of P1,0 & P2,0 = 8, yes)
[0,0]1x[1,1]4 (support = root-count of P1,0 & P4,1 = 11, yes)
[0,0]2x[1,1]4 (support = root-count of P2,0 & P4,1 = 8, yes)
[1,1]2x[1,1]4 (support = root-count of P2,1 & P4,1 = 8, yes)
[0,0]3x[1,1]4 (support = root-count of P3,0 & P4,1 = 8, yes)
[1,1]3x[1,1]4 (support = root-count of P3,1 & P4,1 = 8, yes)
1C3 (1-bit-value candidate 3Asets):
[0,0]1x[0,0]2x[1,1]4 (support = rc of P1,0 & P2,0 & P4,1 = 8, yes)
1L3 (1-bit-value frequent 3Asets):
[0,0]1x[0,0]2x[1,1]4 (support = rc of P1,0 & P2,0 & P4,1 = 8, yes)
Only the following frequent sets involve Yield:
[0,0]1x[0,0]2 (support = root-count of P1,0 & P2,0 = 8, yes)
[0,0]1x[1,1]4 (support = root-count of P1,0 & P4,1 = 11, yes)
[0,0]1x[0,0]2x[1,1]4 (support = rc of P1,0 & P2,0 & P4,1 = 8, yes)
and the rules which can be formed with yield as the consequent are:
[0,0]2 => [0,0]1 (support = 8)
[1,1]4 => [0,0]1 (support = 11)
[0,0]2x[1,1]4 => [0,0]1 (support = 8)
The supports of the antecedents are:
supp([0,0]2) = 8
supp([1,1]4) = 16
supp([0,0]2x[1,1]4) = 8
The confidences are:
conf( [0,0]2 => [0,0]1 ) = 1
conf( [1,1]4 => [0,0]1 ) = 11/16
conf( [0,0]2x[1,1]4 => [0,0]1 ) = 1
All are strong rules.
Assume minsupp-60% and minconf=60%
1. FIND ALL FREQUENT 1Asets.
for 1-bit values
for B1
2 possibilities for Int1: [1,1] and [0,0]
P1,1
5
0 0 1 4
0001
supp([1,1]1)=5 not frequent
supp([0,0]1)=11 frequent
for B2
2 possibilities for Int2: [1,1] and [0,0]
P2,1
8
0 0 4 4
supp([1,1]2)=8 not frequent
supp([0,0]2)=8 not frequent
for B3
2 possibilities for Int3: [1,1] and [0,0]
P3,1
8
4 0 4 0
supp([1,1]3)=8 not frequent
supp([0,0]3)=8 not frequent
for B4
2 possibilities for Int4: [1,1] and [0,0]
P4,1
16
supp([1,1]4)=16 frequent
supp([0,0]4)=0 not frequent
1L1 (1-bit value frequent 1Asets):
[0,0]1 =11
[1,1]4 =16
1C2 (1-bit-value candidate 2Asets):
[0,0]1x[1,1]4 (support = root-count of P1,0 & P4,1 = 11, yes)
from:
P1,0
11
4 4 3 0
1110
P2,1 P2,0
8 8
0 0 4 4 4 4 0 0
P3,1 P3,0
8 8
4 0 4 0 0 4 0 4
P4,1
16
1L2 (1-bit value frequent 2Asets):
[0,0]1x[1,1]4 (support = root-count of P1,0 & P4,1 = 11, yes)
1C3 (1-bit-value candidate 3Asets): empty
The supports of the antecedent:
supp([1,1]4) = 16
conf( [1,1]4 => [0,0]1 ) = 11/16
This one rule is a strong rule.
The frequent 1-bit 1Asets were:
[0,0]1
[1,1]4
Therefore the infrequent 1-bit 1Asets are:
[1,1]1
[0,0]2
[1,1]2
[0,0]3
[1,1]3
[0,0]4
Which means all enclosed 2-bit subintervals are infrequent:
0 1
00 01 10 11
000 001 010 011 100 101 110 111
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
The candidate 2-bit band-1 intervals have left endpt in the 0-subtree.
([00,01] is [0,0] and [00,10] is a superset, so both are frequent)
[00,01] [00,10] are already known to be frequent
Others to consider are:
[00,00] [01,01] [01,10] [01,11]
For [00,00] we use P1,00, count=7, not frequent.
For [01,01] we use P1,01, count=4, not frequent.
For [01,10] we use P1,01 OR P1,10 if it's frequent so is [01,11], else
For [01,11] we use P1,01 OR P1,10 OR P1,11
P1,01 OR P1,10:
6
0 4 1 1
0001 1000
so [01,10] we use P1,01 OR P1,10, count=6, not frequent.
P1,01 OR P1,10 OR P1,11
9
0 4 1 4
0001
so [01,11] we use P1,01 OR P1,10 OR P1,11, count=9, not frequent.
Therefore the only new frequent 2-bit band-1 1Aset is:
[00,10]s
etc.