unstructured query processing (data mining)
The whole point of a database is to residualize relationships among
data items for enterprise use.
Relationships, as studied by ER diagrams and other tools for modeling data
(Data Engineering) are described using relations (or tables).
The relation, R(A1,..An) defined on domains D1,...,Dn is a degree=n
relationship among values in the n domains
Any relationship can be diagrammed (or pictured) using a graph.
The graph of the relation(ship), R(A1,...,An) is an n-partite undirected graph
in which the n-way hyper-edges interconnect values from D1,...,Dn.
This hypergraph is difficult to draw usefully on a 2-D plane
(sheet of paper), and therefore is seldom attempted.
However, for a degree=2 relationship, R(A1,A2), drawing the bipartite
graph is very helpful in understanding and studying the relationship.
The bipartite graph is often called an x-y scatter plot, where
x is a variable on D1 and y is a variable on D2 and
each plot point represents a related pair (edge in the graph)
The relational model is a horizontal model in which the focus is on the edges
(horizontal data structure listing the nodes involved in that edge plus,
possible, node labels and/or edge labels).
A Vertical Model focuses on the nodes.
E.g., for each D1-entity-node, the D1-centric vertical model
associates (or bit maps) the set of D2-entity-nodes
that are related to that D1-node.
And, for each D2-entity-node, the D2-centric vertical model
associates (or bit maps) the set of D1-entity-nodes
that are related to that D2-node.
Market basket data (MBR)
Data is organized vertically as a
TRANSACTION TABLE with 2 attributes: T(Tid, Itemset)
- A transaction is a customer transaction at a cash register.
- Each is given an identifier, Tid.
- Itemset is the set of items in the customer's "basket".
Note that tuples in T are not "flat" (each associated itemset is a "set")
That's can be problematic for analysis, so typically
a transformation is made to the dual, the Boolean or Bitmap model:
Market Basket Data, the Boolean or Bitmapped Model:
Boolean Transaction Table: BT(Tid, Item-1, Item-2,... Item-n)
Tid is a transaction identifier
Each Item-i column is a Boolean column (or bit vector) which indicates
which items that transactions relates to, by turning on (to 1) only
those bit positions corresponding to related items.
Clearly, we don't want to have to specify the correspondence (map)
between bit positions and items anew for every column. Therefore we
do this mapping once and for all using a Domain Vector Table or DVT.
The DVT need not map the entire domain but only the
Extant Domain (of all currently existing values)
So a 1-bit means that item is in the market basket and a
0-bit means that item is not in the market basket.
Again, in any bipartite relationship between two entities, T and I
(eg, Customer Transactions and purchasable items in Market Basket Research),
there are always two vertical models,
one in which we focus on D1-entity nodes
(and list or map, for each, the set of related D2-entity-nodes)
the other in which we focus on D2-entity nodes
(and list or map, for each, the set of related D1-entity-nodes)
Thus in MBR, one has the dual vertical model, I(Iid, TransSet)
which, for each item, lists of bitmaps the transactions involving it.
Note that in MBR T, BT, I and BI usually only record existence/non-existence
of each item in market baskets, not the number of particular items
We can think of it this way: An item id is a UPC (universal Product Code)
or barcode only (identifying the "type" not the instances of that item).
With nano-RFID tags, ePC will be used (electronic Product Code) wich not
only distinguishes type but also instance of an item (like VINs for autos)
When RFID item identification becomes ubiquitous, we will need to analyze
by ePC not just UPC.
Much research still needs to be done when analyzing the data
where the number of each item (the counts) are imporant (e.g., ePC tagged items)
We can treat that situation by identifying instances of items and using T or BT
above, or by using
COUNT TRANSACTION TABLE: CT(Tid, Item-1, Item-2, ..., Item-n),
where values are the counts of each (UPC id-ed) item in that trans.
Note: T(Tid,ItemSet) and BT(Tid,I1..In)
don't take account of hierarchical structures of items
- e.g., in T, milk is an item at one level,
- milk breaks down into skim, whole, 2%.. at a finer level...
- Work on hierarchical MBR is ongoing. New ideas are welcome!
In the Market Basket Research (MBR) models, these Vertical Boolean Tables are
- extremely wide (many many columns)
- sometimes very shallow (not many trans e.g., Cancer data - few samples)
- extremely sparse (mostly 0's - i.e., no customer buys most of the
items in the store in one shopping trip!)
Bioinformatics/genetics data is remarkably similar.
Microarray Data Analysis (MDA) is the analysis of the gene expression levels
of genes spotted on glass slides (Microarrays or Gene Chips) and subjected to
"treatments" or experiments that record the gene expression level
before (using red dye) and
after (using green dye).
For each experiment and each spotted gene, the logarithm of the ratio of
red/green is recorded in a Microarray Data Analysis Table.
MDA is usually stored as an Excel spreadsheet:
GeneTable: GT(Gid,E1...En)
row = gene
column = experiment (plus other columns)
value = log ratio of r/g (a Real Number)
BinaryGeneTable: BGT(Gid,E1...En)
is the table you get by setting a threshold
expression ratio and recording 1 iff it is exceeded:
Note: sometimes 3-value logic is used, in which there is an expression
threshold and a repression threshold. We will call that the
TernaryGeneTable: TGT(Gid, E1,...,En)
BinaryExperimentTable: BGT(Eid, Gene1,...,GeneN) is similar to BT in MBR
****************************************************************
Formally in MBR, the BTT is defined as follows:
I={i1..im} is the set of items.
=====
- eg, an item for purchase in a store
- Each item in a store is an attribute, Ai,
- with Boolean values (1 = in a customer's "market basket
or shopping cart" and 0=not in it).
An itemsets is a subset of I, (eg, set of items in a store)
========
A k-itemset is an itemset of cardinality (size) k
=========
D={ti..tn} are transactions (eg, customer transaction at checkout)
============
Each ti has an identifier and an itemset, ti = (t-id, t-itemset)
A transaction,t,
SUPPORTS an itemset,A, if A IS CONTAINED IN t-itemset.
========
An Association rule is an implication A => C,
================ where A and C are disjoin itemsets.
( A = antecedent and C = consequent)
- rules have quality or interestingness measures,
two of which are support and confidence:
SUPPORT OF ITEMSET B is the ratio, s, of transactions containing B
==================
SUPPORT OF RULE A=>C is the support of A u C
===============
CONFIDENCE OF A=>C = fraction of those trans suppporting A that
========== also support C.
- conf(A=>C) = supp(AuC) / supp(A)
- The confidence measures the strength of the implication.
- Both support and confidence can be measured as %'s.
- As a %, confidence is the conditional probability, P(C|A)
FREQUENT ITEMSETs are those with support >= a threshold, minsupp.
================
- The set of frequent k-itemsets is denoted Lk.
CONFIDENT RULEs are those with confidence >= a threshold, minconf.
==============
STRONG RULEs are confident rules with frequent support sets.
===========
Given a user specified minsupp and minconf, our first task is to
find ALL strong rules, called Association Rule Mining, ARM, using:
=======================
1. Find all frequent itemsets, Lk. (for each k > 1)
2. For each frequent itemset, B,
find all strong rules supported by that frequent itemset
(find all antecedent subsets, A s.t. A==>B-A is strong)
- the performance of ARM is largely determined by 1.
APRIORI ALGORITHM
=================
Based on the algorithm pruning technique:
Any subset of frequent itemset must also be frequent.
FINDING FREQUENT ITEMSETS:
Start by finding all frequent 1-itemsets, L1.
Then candidates for L2 consist of joins of sets from L1,
where 2 itemsets "join" if they're identical except for 1 member.
Let Ck = set of Candidate k-itemsets ( Lk-1 JOIN Lk-1 )
1st Iteration: Scan D for L1.
Kth Iteration: Create Ck as Lk-1 JOIN Lk-1
Scan Ck for the frequent k-itemsets, Lk
GENERATING STRONG RULES:
For each B is in Lk, find all strong rules, A => B-A.
A' < A -> supp(A') >= supp(A) -> conf(A'=>B-A') <= conf(A=>B-A)
If A is not a strong-rule-antecedent in B, then A' isn't either.
So, for each L in Lk,
1. start with largest antecedent sets (k-1 item subsets)
2. next consider only (k-2)-item antecedents for which
every (k-1)-item SuperAntecedent produced a strong rule
Said another way (better way?),
Consider only those 2-item consequents for which both
1-item subsets were strong rule consequents
SUMMARY:
1. supp(B) = |{t: B is a subset of t-itemset}| / |D|
2. supp(A=>C) = supp(AuC)
3. conf(A=>C) = supp(AuC)/supp(A)
APRIORI
4. Scan D to find L1
5. Form candidate 2-itemsets, C2, as L1 JOIN L1
6. Scan C2 for L2;
...
7. For each Lk,
find strong minimal consequents;
find strong minimal superset consequents of those, etc.
EXAMPLE 1:
I = {a,b,c,d,e}; D = {100,200,300,400}
Sample transaction database:
TID ItemLists
--- ------------------------------
100 a c d
200 b c e
300 a b c e
400 b e
minsupp=50% (itemset is frequent if >= 2 transactions support it)
minconf=60% (rule is confident if >= conditional prob >= .6)
The process of finding frequent itemsets:
C1 C2 C3 C4
Iset Sup Freq Iset Sup Freq Iset Sup Freq Freq Iset gen ends.
{a} 2 y {a,b} 1 {b,c,e} 2 y
{b} 3 y {a,c} 2 y
{c} 3 y {a,e} 1
{d} 1 {b,c} 2 y
{e} 3 y {b,e} 3 y
{c,e} 2 y
Derive association rules.
For frequent 3-itemsets, start with 1-item consequents: conf?
Rule1: b^c ==> e, confidence = 100%. =Sup{b,c,e}/Sup{b,c} y
Rule2: b^e ==> c, confidence = 66.7%. =Sup{b,c,e}/Sup{b,e} y
Rule3: c^e ==> b, confidence = 100%. =Sup{b,c,e}/Sup{c,e} y
Form all 2-item consequents from high-conf 1-item consequents:
Rule4: b ==> c^e, confidence = 66.7%. =Sup{b,c,e}/Sup{b} y
Rule5: c ==> b^e, confidence = 66.7%. =Sup{b,c,e}/Sup{c} y
Rule6: e ==> b^c, confidence = 66.7%. =Sup{b,c,e}/Sup{e} y
For each frequent 2-Isets, start with 1-item consequents:
For {a,c}
Rule7: a ==> c, confidence = 100% = Sup{a,c}/Sup{a} y
Rule8: c ==> a, confidence = 66.7% = Sup{a,c}/Sup{c} y
For {b,c}
Rule9: b ==> c, confidence = 66.7% = Sup{b,c}/Sup{b} y
Rule10: c ==> b, confidence = 66.7% = Sup{b,c}/Sup{c} y
For {b,e}
Rule11: b ==> e, confidence = 100% = Sup{b,e}/Sup{b} y
Rule12: e ==> b, confidence = 100% = Sup{b,e}/Sup{e} y
For {c,e}
Rule13: c ==> e, confidence = 66.7% = Sup{c,e}/Sup{c} y
Rule14: e ==> c, confidence = 66.7% = Sup{c,e}/Sup{e} y
All 14 rules are high-confidence.
ESAMPLE 2:
minconf=80%, minsupp=50%:
We get the same frequent itemsets (since we have the same minsupp)
C1 C2 C3 C4
Iset Sup Freq Iset Sup Freq Iset Sup Freq Iset gen ends.
{a} 2 y {a,b} 1 {b,c,e} 2 y
{b} 3 y {a,c} 2 y
{c} 3 y {a,e} 1
{d} 1 {b,c} 2 y
{e} 3 y {b,e} 3 y
Derive association rules.
For frequent 3-itemsets, start with 1-item consequents: conf?
Rule1: b^c ==> e, confidence = 100%. =Sup{b,c,e}/Sup{b,c} y
Rule2: b^e ==> c, confidence = 66.7%. =Sup{b,c,e}/Sup{b,e}
Rule3: c^e ==> b, confidence = 100%. =Sup{b,c,e}/Sup{c,e} y
then all 2-item consequents from high-conf 1-item consequents:
Rule5: c ==> b^e, confidence = 66.7%. =Sup{b,c,e}/Sup{c}
For each frequent 2-Isets, start with 1-item consequents:
For {a,c}
Rule7: a ==> c, confidence = 100% = Sup{a,c}/Sup{a} y
Rule8: c ==> a, confidence = 66.7% = Sup{a,c}/Sup{c}
For {b,c}
Rule9: b ==> c, confidence = 66.7% = Sup{b,c}/Sup{b}
Rule10: c ==> b, confidence = 66.7% = Sup{b,c}/Sup{c}
For {b,e}
Rule11: b ==> e, confidence = 100% = Sup{b,e}/Sup{b} y
Rule12: e ==> b, confidence = 100% = Sup{b,e}/Sup{e} y
For {c,e}
Rule13: c ==> e, confidence = 66.7% = Sup{c,e}/Sup{c}
Rule14: e ==> c, confidence = 66.7% = Sup{c,e}/Sup{e}
Only Rules 1,3,7,11,12 are high-confidence.
EXAMPLE 3: mconf=80%, msup=70%
TId Items
--- -----------------------------
100 a c d
200 b c e
300 a b c e
400 b e
We get new frequent itemsets.
Cand_1-Isets Cand_2-Isets Cand_3-Isets is empty.
Iset Sup Freq Iset Sup Freq Freq Iset generation
{a} 2 {b,c} 2 ends.
{b} 3 y {b,e} 3 y
{c} 3 y {c,e} 2
{d} 1
{e} 3 y
Derive association rules.
For frequent 2-itemsets, start with 1-item consequents: conf?
Rule1: b => e, confidence = 100%. =Sup{b,e}/Sup{b} y
Rule2: e => b, confidence = 100%. =Sup{b,e}/Sup{e} y
Rules 1,2 are high-confidence.
EXAMPLE 4: mconf=80%, msup=80%
We get new frequent itemsets.
TId Items
--- -----------------------------
100 a c d
200 b c e
300 a b c e
400 b e
Cand_1-Isets Cand_2-Isets is empty.
Iset Sup Freq Freq Iset generation
{a} 2 ends.
{b} 3
{c} 3
{d} 1
{e} 3
derive association rules. There are no frequent itemsets. done.
These examples should demonstrate how much the pruning rules
simplify the cases with higher support and confidence.
HASH-BASED techniques (hashing itemset counts)
=====================
- To reduces the size of Ck for k > 1 (especially k=2), while
scanning D to determine which itemsets in Ck are to be in Lk
create a hash table of counts of (k+1)-itemsets
Example:
Take a transaction universe, D = {I1,I2,I3,I4,I5} as follows:
Tid T-itemset
---- --------------
T100 | I1 I2 I5
T200 | I2 I4
T300 | I2 I3
T400 | I1 I2 I4
T500 | I1 I3
T600 | I2 I3
T700 | I1 I3
T800 | I1 I2 I3 I5
T900 | I1 I2 I3
T1000| I1 I4
minsupp_cnt = 6
While finding frequent 1-itemsets, creating count histogram
of the form:
Itemset Support
------- -------
{I1}
{I2}
{I3}
{I4}
{I5}
Also create hash table by hashing (Ix,Iy) using
H2(x,y)= ( x*5 + y )MOD7
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
-----------|-----|-----|-----|-----|-----|-----|-----|
bucket_cnt | | | | | | | |
-----------|-----|-----|-----|-----|-----|-----|-----|
buck_content | | | | | | |
(Note that the bucket_content is just include to aid the reader)
Starting scan at: T100 | I1 I2 I5
Itemset Support
------- -------
{I1} 1
{I2} 1
{I3}
{I4}
{I5} 1
H2(1,2)= (1*5+2=7)MOD7 = 0
H2(1,5)= (1*5+5=10)MOD7= 3
H2(2,5)= (2*5+5=15)MOD7= 1
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 1 | 1 | | 1 | | | |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5| |I1 I5| | | |
| | | | | | | |
Continuing scan at: T200 | I2 I4
Itemset Support
------- -------
{I1} 1
{I2} 2
{I3}
{I4} 1
{I5} 1
H2(2,4)= (2*5+4=14)MOD7= 0
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 2 | 1 | | 1 | | | |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5| |I1 I5| | | |
|I2 I4| | | | | | |
Continuing scan at: T300 | I2 I3
H2(2,3)= (2*5+3=13)MOD7= 6
Itemset Support
------- -------
{I1} 1
{I2} 3
{I3} 1
{I4} 1
{I5} 1
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 2 | 1 | | 1 | | | 1 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5| |I1 I5| | |I2 I3|
|I2 I4| | | | | | |
Continuing scan at: T400 | I1 I2 I4
H2(1,2)= (1*5+2=7)MOD7= 0
H2(1,4)= (1*5+4=9)MOD7= 2
H2(2,4)= (2*5+4=14)MOD7=0
Itemset Support
------- -------
{I1} 1
{I2} 4
{I3} 1
{I4} 2
{I5} 1
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 4 | 1 | 1 | 1 | | | 1 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5|I1 I4|I1 I5| | |I2 I3|
|I2 I4| | | | | | |
|I1 I2| | | | | | |
|I2 I4| | | | | | |
Continuing scan at: T500 | I1 I3
H2(1,3)= (1*5+3=8)MOD7= 1
Itemset Support
------- -------
{I1} 2
{I2} 4
{I3} 2
{I4} 2
{I5} 1
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 4 | 2 | 1 | 1 | | | 1 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5|I1 I4|I1 I5| | |I2 I3|
|I2 I4|I1 I3| | | | | |
|I1 I2| | | | | | |
|I2 I4| | | | | | |
Continuing scan at: T600 | I2 I3
H2(2,3)= (2*5+3=13)MOD7= 6
Itemset Support
------- -------
{I1} 2
{I2} 5
{I3} 3
{I4} 2
{I5} 1
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 4 | 2 | 1 | 1 | | | 2 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5|I1 I4|I1 I5| | |I2 I3|
|I2 I4|L1 I3| | | | |I2 I3|
|I1 I2| | | | | | |
|I2 I4| | | | | | |
Continuing scan at: T700 | I1 I3
H2(1,3)= (1*5+3=8)MOD7= 1
Itemset Support
------- -------
{I1} 3
{I2} 5
{I3} 4
{I4} 2
{I5} 1
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 4 | 3 | 1 | 1 | | | 2 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5|I1 I4|I1 I5| | |I2 I3|
(discontinue|I2 I4|I1 I3| | | | |I2 I3|
showing |I1 I2|I1 I3| | | | | |
buck_content)I2 I4
Continuing scan at: T800 | I1 I2 I3 I5
H2(1,2)= (1*5+2=7)MOD7= 0
H2(1,3)= (1*5+3=8)MOD7= 1
H2(1,5)= (1*5+5=10)MOD7=3
H2(2,3)= (2*5+3=13)MOD7=6
H2(2,5)= (2*5+5=15)MOD7=1
H2(3,5)= (3*5+5=20)MOD7=6
Itemset Support
------- -------
{I1} 4
{I2} 6
{I3} 5
{I4} 2
{I5} 2
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 5 | 5 | 1 | 2 | | | 4 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
Continuing scan at: T900 | I1 I2 I3
H2(1,2)= (1*5+2=7)MOD7= 0
H2(1,3)= (1*5+3=8)MOD7= 1
H2(2,3)= (2*5+3=13)MOD7=6
Itemset Support
------- -------
{I1} 5
{I2} 7
{I3} 6
{I4} 2
{I5} 2
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 6 | 6 | 1 | 2 | | | 5 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
Continuing scan at: T1000| I1 I4
H2(1,4)= (1*5+4=9)MOD7= 2
Itemset Support
------- -------
{I1} 6
{I2} 7
{I3} 6
{I4} 3
{I5} 2
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 6 | 6 | 2 | 2 | | | 5 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
Since Minsup_cnt = 6, L1 = {I1,I2,I3}
The usual C2 would be { {I1,I2}, {I1,I3}, {I2,I3} }
but by first applying H2 we see that C2 can be pruned
to { {I1,I2} {I1,I3} } since H2({I2,I3})=6 and that bucket
count is only 5.
DHP:
===
The above is an introduction to the DHP
(Direct Hashing and Pruning) methods.
DIC:
===
Dynamic Itemset Counting method begins to count
cand 3-itemsets before completing the count of cand 2-itemsets,
cand 4-itemsets before completing the count of cand 3-itemsets,
etc. This reduces the number of database scans required.
TRANSACTION REDUCTION
=====================
(2nd method for improving the efficiency of Apriori)
- A trans that does not contain any frequent k-itemsets
cannot contain any frequent (k+1)-itemsets.
Therefore it can be removed.
PARTITIONING (partitioning the data to find candidate itemsets)
============
- partition D into n partitions (each with a minsup' of minsup/n)
- throw out those transactions that don't achieve minsup' in any
partition.
SAMPLING (mining on a subset of the given data)
========
- Pick a sample, S;
Look for frequent itemsets in S (may miss some)
FP-growth:
=========
Another method for improving the efficiency of finding
frequent itemsets is the FP-growth method in which a
complex data structure is constructed from which all
frequent itemsets can be determined without doing
additional database scans.
In this method we try to reduce time required to find frequent
item sets by going right to the isolation of frequent itemsets
without first generating candidate frequent item sets. This
will reduce the size and number of database scans required.
Assume a minimum support count of 2.
TID Items
T100 | I1 I2 I5
T200 | I2 I4
T300 | I2 I3
T400 | I1 I2 I4
T500 | I1 I3
T600 | I2 I3
T700 | I1 I3
T800 | I1 I2 I3 I5
T900 | I1 I2 I3
T1000| I3 I4
First scan the database for frequent 1-itemsets and
sort them in order of descending support count:
L-order: I2:7, I1:6, I3:6, I4:2, I5:2
Then create the root of the FP-tree and label it null:
(_) null
Scan the database processing items in L-order.
Create a branch in L-order for each trans (label nodes item:count)
T100 | I1 I2 I5
Item_Header_Table ....(_) null
Item Cnt Link I2:1_/
----- --- ---- .---- > (_)
{I2} 1-------' I1:1_/
{I1} 1----------- > (_)
{I3} 0 /
{I4} 0 I5:1_/
{I5} 1------- > (_)
To facilitate tree traversal, an Item_Header_Table (IHT) is built
so each item is linked to its occurrences in the tree.
Continuing: T200 | I2 I4
Since we already have a I2:1 node linked from the root we share it
and increment its count (always share prefixes with existing paths
Item_Header_Table ....(_) null
Item Cnt Link I2:2_/
----- --- ---- .---- > (_)
{I2} 2-------' I1:1_/ \ _I4:1
{I1} 1----------- > (_) (_)
{I3} 0 / ^
{I4} 1-. I5:1_/ :
{I5} 1------- > (_) :
: :
`------------------'
Continuing: T300 | I2 I3
____
Item_Header_Table ...(null)
Item Cnt _/__
----- --- .--------(I2:3)
{I2} 3-------' ____/ _|__ \
{I1} 1----------(I1:1) (I4:1) (I3:1)
{I3} 1-------------/------:----'
{I4} 1-. ____/ :
{I5} 1-------(I5:1) :
: :
`------------------'
Continuing: T400 | I2 I1 I4
____
Item_Header_Table ...(null)
Item Cnt _/__
----- --- .--------(I2:4)
{I2} 4-------' ____/ _|__ \
{I1} 2----------(I1:2) (I4:1) (I3:1)
{I3} 1-------------/-|--:--:---'
{I4} 2-. ____/ | . .
{I5} 1-.-----(I5:1) (I4:1) .
. .
`-------------------'
Continuing: T500 | I1 I3
.........................
Item_Header_Table : ...(null).... :
Item Cnt : _/__ \_:__
----- -- ..........:.....(I2:4) (I1:1)
{I2} 4.: :___/ _\__ \ _\__
{I1} 3..........(I1:2) (I4:1) (I3:1)---(I3:1)
{I3} 2............/.\...:..:...'
{I4} 2.. ___/ \__: :
{I5} 1.:.....(I5:1) (I4:1) :
:...................:
Continuing: T600 | I2 I3
.........................
Item_Header_Table : ...(null).... :
Item Cnt : _/__ \_:__
----- -- ..........:.....(I2:5) (I1:1)
{I2} 5.: :___/ _\__ \ _\__
{I1} 3..........(I1:2) (I4:1) (I3:2)---(I3:1)
{I3} 3............/.\...:..:...'
{I4} 2.. ___/ \__: :
{I5} 1.:.....(I5:1) (I4:1) :
:...................:
Continuing: T700 | I1 I3
.........................
Item_Header_Table : ...(null).... :
Item Cnt : _/__ \_:__
----- -- ..........:.....(I2:5) (I1:2)
{I2} 5.: :___/ _\__ \ _\__
{I1} 4..........(I1:2) (I4:1) (I3:2)---(I3:2)
{I3} 4............/.\...:..:...'
{I4} 2.. ___/ \__: :
{I5} 1.:.....(I5:1) (I4:1) :
:...................:
Continuing: T800 | I2 I1 I3 I5
.........................
Item_Header_Table : ...(null).... :
Item Cnt : __/____ \_:__
----- -- ..........:....(__I2:6_) (I1:2)
{I2} 6.: :___/ _\____ \ _\__
{I1} 5..........(I1:3) (_I4:1_) (I3:2) (I3:2)
{I3} 5............/.\.\:......:...' `.....' :
{I4} 2.. ___/ __\_:\ : :
{I5} 2.:....(I5:1)(I4:1)(I3:1): :
:.....:............\..:: :
: \ :...............:
: __\_
: (I5:1)
:.............:
Continuing: T900 | I2 I1 I3
.........................
Item_Header_Table : ...(null).... :
Item Cnt : __/____ \_:__
----- -- ..........:....(__I2:7_) (I1:2)
{I2} 7.: :___/ _\____ \ _\__
{I1} 6..........(I1:4) (_I4:1_) (I3:2) (I3:2)
{I3} 5............/.\.\:......:...' `.....' :
{I4} 2.. ___/ __\_:\ : :
{I5} 3.:....(I5:1)(I4:1)(I3:2): :
:.....:............\..:: :
: \ :...............:
: __\_
: (I5:1)
:.............:
This give us all the frequent patterns of any length
and therefore no further database scans are necessary.
- tremendous time savings!
- only two database scans necessary and then extensive
processing of the FP-tree.
Mining Distance-based Association Rules
=======================================
- Previous section described QARs where quantitative attribs
are discretized initially by binning,
then the intervals are combined.
- Such an approach may not capture semantics since it ignores
distances
- A distance-based AR method captures the semantics of interval
data while allowing for approximation in data values.
- a 2-phase algorithm can be used to mine distance based ARs.
- The FIRST PHASE employs clustering to find intervals or clusters
- a density threshold and a frequency threshold are required of
a cluster (must be close and numerous)
- The SECOND PHASE obtains distance-based ARs by searching for
groups of clusters that occur frequently together.
- To conclude that Ac => Cc, we want the antecedent-cluster, Ac,
when projected onto the consequent-space to be within the
consequent cluster, Cc.
ARM and Correlation Analysis
============================
- most ARM methods employ a support-confidence framework to
weed out uninteresting or misleading rules.
- even strong ARs can be uninteresting or misleading.
- methods of statistical independence and correlation analysis
can help weed them out.
- Example:
MISLEADING RULES and REDUNDANT RULES
====================================
In MBR basket case, consider T={tea} and C={coffee} |D|=100 trans.
MISLEADING:
coffee NOTcoffee|total
.---------------|---- Conf(T=>C)=|TUC|/|T|= 20/25= .8
tea | 20 5 | 25 Conf(D=>C)=supp(C)=
NOTtea | 70 5 | 75 |C|/|D|=90/100= .9
------|---------------|---- So the rule T >C is misleading.
total | 90 10 | 100
REDUNDANT:
coffee NOTcoffee| tot
.-----------------|---- C(T=>C)=20/22= .9090
tea | 20 2 | 22 C(D=>C)=90/100= .9000
NOTtea | 70 8 | 78
------|-----------------|---- Within .0090 of each other
total | 90 10 | 100 so they are redundant rules.
Text Association Rule Mining:
----------------------------
Given an alphabet, A, the
12 n
nWordUniverse, Wn = A u AA u AAA u ... U AA..A
where, eg, AA = {ab | a and b are distinct and belong to A}.
Let W be a subset of U(i=1..n)Wi for some n (the DICTIONARY)
Let S be a subset of U(i=1..m)Wi some m (the SENTENCES)
- m is usually bigger than n
We can do ARM on the Universe of "Items", W, and
"Transactions, S, as above.
- A HIGH CONFIDENCE RULE, A => F (A,F are disjoint WordSets)
tells us: if all the A-words occur in a sentence then, with
high confidence, all the F-words will also.
- If the rule, A => F, has HIGH SUPPORT, that means all the
A-words and all the F-words occur in most of the sentences.
There are alternate ways to deal with such text situations.
APPENDIX 1
----------
(ARM for spatial data)
THE DATA
Spatial Data Organizations
First we consider ways of organizing spatial data. Spatial attributes such as
remotely sensed reflectances (R,G,B,NIR,..), ground attributes (yield levels,
soil moisture levels, elevations),etc., are referred to as "bands".
Let R(P, B1,...Bn) be the file or relation containing these data bands as columns
or attributes for a particular space or area, where P is the key (pixel coordinates,
x-y, of the points in the space) and each column, B1,..,Bn, measures the level of
that attribute for each pixel location. If the inverted list model is used rather
than the relational model (so that one can assume an ordering of the tuples), the
raster ordering of coordinates is usually assumed (first row, followed by second row,
followed by third row, ...)
This Relational (REL) organization is the starting point or basic organization.
In Band-SeQuential (BSQ) organization, the REL organization is projected into many
files - a separate files for each band or column. The coordinate ordering is
assumed to be raster order, and thus need not be part of each band file. Each
band file is then a 1-column file of the measurements for that attribute at each
pixel in raster order (eg, TM data from Landsat satellites is organized as BSQ).
In Band-Interleaved-by-Line (BIL) there is just one file in which the
first row (line) of the first band is followed by the
first row of the second band, ..., followed by the
first row of the last band, followed by the
second row of the first band, followed by the
second row of the second band, ... etc. (e.g., SPOT data from French Satellites is BIL)
In Band-Interleaved-by-Pixel (BIP), there is just one file in which the
first pixel-value of the first band is followed by the
first pixel-value of the second band,..., the
first pixel-value of the last band, followed by the
second pixel-value of the first band,... (e.g., tiff images are BIP).
We note BIP is nearly identical to REL except there are no explicit "record"
or row boundary markers (ie, data is not organized into records, but the values
are in the same order as they are in REL).
A new organization at the "interleaving extreme" end of this spectrum of
organizations is Band-Interleaved-by-bit (BIb) in which there is just one file, the
first bit of the first pixel-value of the first band is followed by the
first bit of the first pixel-value of the second band,..., the
first bit of the first pixel-value of the last band, followed by the
second bit of the first pixel-value of the first band,...
Another new organization, at the other end of this organization spectrum is
bit-SeQential (bSQ) in which each bit of each band, B11..,18, B21..B28 ... Bn1..Bn8
is a separate file. We will use bSQ organization later in this course.
We have the following spectrum of Band-oriented organizations:
REL is the basic organization in which there is one file an no interleaving (we say
there is no interleaving since a relation is a "set" of tuples, not a sequence and
each tuple is a "set" of attribute values, not a sequence. i.e., in a relation,
there is no ordering of values).
more interleaving-- >
bSQ BSQ BIL BIP BIb
< -- more files
A very simple illustrative example (with only 2 bands, each having only 2 rows and 2 columns)
BAND-1 BAND-2
254 127 37 240
(1111 1110) (0111 1111) (0010 0101) (1111 0000)
14 193 200 19
(0000 1110) (1100 0001) (1100 1000) (0001 0011)
REL organization: RRN |x-y | B1 | B2 | (a set of tuples,
|====|====|====| each tuple is a set of attribute values)
0 |0,0 |254 | 37 |
1 |0,1 |127 |240 |
2 |1,0 | 14 |200 |
3 |1,1 |193 | 19 |
bSQ organization: (16 files)
B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1
0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0
1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1
BSQ organization: B1 B2 (two separate files, values given in decimal)
---- ----
254 37
127 240
14 200
193 19
BIL organization: (one file, values given in decimal)
254 127 37 240 14 193 200 19
BIP organization: (one file, values given in decimal)
254 37 127 240 14 200 193 19
Bib organization: (one file, values given in decimal)
10 10 11 10 10 11 10 01 01 11 11 11 10 10 10 10
01 01 00 00 11 10 10 00 10 10 00 01 00 00 01 11
Thru simple offset arithmetic, one can convert among these organizations.
**************************************************************
Note that in traditional Market Basket Data Mining each "item"
is treated as a separate column or attribute of REL and the values
are Boolean (1 or 0, for yes or no). Thus, for MBDM, we start with:
B1 B2 B3 B4 B5 B6 B7 ...
REL organization: |trans|hat |shoe|coat|milk|beer|soap|nails|...
|=====|====|====|====|====|====|====|=====|...
(one relation) |tid-1| 1 | 0 | 0 | 1 | 0 | 1 | 0 |...
|tid-2| 0 | 0 | 0 | 0 | 1 | 0 | 0 |...
|tid-3| 0 | 0 | 1 | 0 | 0 | 0 | 0 |...
|tid-4| 0 | 1 | 0 | 1 | 0 | 1 | 0 |...
|tid-5| 0 | 0 | 0 | 0 | 1 | 0 | 1 |...
. . .
bSQ organization: B1 B2 B3 B4 B5 B6 B7 ...
1 0 0 1 0 1 0
(separate 0 0 0 0 1 0 0
file for 0 0 1 0 0 0 0
for each item) 0 1 0 1 0 1 0
0 0 0 0 1 0 1
. . .
BSQ organization: B1 B2 B3 B4 B5 B6 B7 ...
1 0 0 1 0 1 0
(identical to 0 0 0 0 1 0 0
bSQ) 0 0 1 0 0 0 0
0 1 0 1 0 1 0
0 0 0 0 1 0 1
. . .
BIL organization: (transactions are not in any natural 2-D arrangement
so we consider the tid's to constitute one big row)
10000...00010...00100...10010...01001...10010...00001...
BIP organization:
10010100000100001000001010100000101...
BIb organization: (each pixel is a bit, thus BIb = BIL)
With Boolean data from a Market Basket Database,
in bSQ=BSQ the data is organized into
a separate file for each item ordered by transaction
in BIL the data is organized onto one file ordered by transaction first
and then by item.
in BIP=BIb the data is organized onto one file ordered by item first
and then by transaction.
Note: Market Basket Data Mining is done assuming the REL organization.
**************************************************************
An example of spatial data comes from precision agriculture,
we subdivided or "grid" a field into "pixels" or points (usually evenly).
0 1 2 3 4 5 6 7 8 9 10 11 12
.---.---.---.---.---.---.---.---.---.---.---.---.---.
0| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
1| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
2| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
3| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
4| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
5| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
6| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
7| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
8| | | | | | | | | | | | | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
.| | | | | | | | | | | | | |
.
.
The reflectance levels within given spectral ranges (e.g., Red, Green, Blue..)
are captured by a sensor and recorded in raster-ordered BANDs
RED-band
pix refl
0,0 24
0,1 26
0,2 49
0,3 68
0,4 93
0,5 119
.
.
.
The key for each band is the x,y coordinates. This attribute is usually
omitted since the raster ordering is taken to be understood.
So a "BAND" is a single attribute file of
the relative reflectance levels (expressed as numbers in [0, 255]) observed
in a particular color range (or non-visible range such as infra-red...) or an
agricultural band (yield levels - e.g., bushels per acre for each pixel).
An association rule example: "At points in a field where the midsummer,
Near-Infrared (NIR) reflectance is greater than 48 and
Red reflectance is less than 31, then the
Yield will be greater than 128 bu/acre"
The rule is written { NIR>47, R<32 } => { Y>128 }
- the set, { NIR>47, R<32 } is called the "antecedent" of the rule
- the set { Y>128 } is called the "consequent" of the rule
"SUPPORT" of the rule = % (or ratio) of pixels with NIR>47 and R<32 and Y>128.
- as a ratio, it can be expressed |antecedent UNION consequent| / Total
"CONFIDENCE" = %(or ratio) of pixels with NIR>47 and R<32 which also have Y>128
as a ratio it can be expressed |antecedent UNION consequent| / |antecedent|
If support and confidence of this rule is high, that suggests to the producer
that nitrogen fertilizer should be applied where NIR<47 and/or R>32,
so as to maximize the yield in those areas (get it up over 128 Bu/acre).
For ARM, we need to formally define the notions of items, itemsets and
transactions in spatial datasets.
The items: I = {(b,v) : b= a band, v= a reflectance value}
The transactions: D = {t : t=(tid,t-itemset},
tid=(x,y), the pixel row,col and
t-itmeset = {(b,v): b ranges over all bands and
v is the reflectance at pixel, t, in band, b.}
Note right away that the sizes are very very large in the ARM sense
(e.g., for TM satellite images (with yield bands), there are ~40,000,000
transactions, 8*256 = 2048 items and 2^(2048) itemsets!)
The number of transactions (pixels) can be reduced by focusing on a
particular small area (e.g., a field).
The number of itemsets can be reduced by noting that a pixel can have only
one reflectance value from a given band. Almost always we are interested
in knowing when the values are in a particular range or interval.
Therefore we can restrict our itemset consideration to those composed of
one interval from each band.
---------------------------
In a given band there are 255 ways to pick the left endpoint of the interval
and for left-endpoint, l, there are 255-l ways to pick a right endpoint.
On the average there will be 127 ways to pick the right endpoint.
Thus, there are really only (255*127)^8 or ~(2^8*2^7)^8 = 2^120 = 10^36 =
1,000,000,000,000,000,000,000,000,000,000,000,000 itemsets to consider.
- We can reduce the number of items by partitioning the Bands into intervals
and letting each interval correspond to an value.
Partitioning bands into intervals:
Equilength interval partitioning.
By truncating some of the right-most bits of the values (low order or least
significant bits) we can reduce the size of the itemset dramatically without
loosing too much information (the low order bits show only slight differences).
For example, we can truncate the right-most 6 bits, resulting in 4 intervals,
each of which we consider to be a "value" (e.g., identify each interval
with its midpoint):
[0,64), [64,128), [128,192), [192,256) identified with values, 32, 96, 160, 224
Then there are only 10^8 itemsets or ~ = 100,000,000 itemsets (10 intervals in
each band?). That's still a lot!
Further pruining can be done by understanding what kinds of rules are probably
of interest to the user and focusing on those only. For instance:
For a precision farmer, there is probably little interest in rules of the type,
R>48 => G<134.
A physicist might be interested in relationships among colors observed
(both antecedent and consequent from visible bands), but the farmer is
interested only in relationships where the antecedent is from the color
bands and the consequent if from the yield band (he or she wants to know
what observed color combinations predict high yield).
Therefore, for precision agriculture, we could restrict to those rules that have
consequent from the yield band (and then only the particular interval which
indicates "high yeild") and antecedent from the others, so 10^7 = 10,000,000
itemsets to consider.
We will refer to restrictions of this time (in the type of itemsets allowed for
antecedent and consequent based on interest) as restricting to rules which
are "of interest" (OI rules), as distinct from the notion of rules that are
"interesting". OI rules can be interesting or not interesting, depending on
such measures as support and confidence, etc.
Slalom analogy:
Each transaction (pixel), t, is like a path down a ski hill, each
item is an interval in one band and therefore like a "gate" on the ski slope:
A transaction (pixel) "contains" an itemset, if it "goes thru" each gate
(has band-i reflectance in interval-i).
So if x is an itemset (set of "gates", one for each band),
s(x) is the proportion of paths passing thru the gates of x.
b1 b2 b3 b4 b5 b6 b7 b8
| | .---. | |
t---. | / | \ | |
`---------------------------' | \ | |
| | | \ | .----
| | | | \_______/ |
Non-equi-length:
In some cases, it would be better to allow users to partition interval into
uneven lengths. User knowledge can be applied in interval partition.
Eg,, band bi can be partitioned into 3 intervals {[0,63), [64,127), [128,256)
(if aren't many values between 128 to 255.)
Applying user's domain knowledge increases assoc rules accuracy and efficiency.
Equi-depth partitioning (each partition has approx. the same number of pixels).
Can be done by setting the endpoints so that there are (approximately) the same
number of values in each interval (at the mean value), etc.
Sometime this leads to more reasonable rules.
Whether partitioning is equilength or not, it can be easily characterized as:
For each band, choose interval end-points, e0=0, e1, ..., en+1=256,
then the items are ( bi, [ei,ei+1 ) ), i=0,..n
(in the equilength case there is a common length, ei - e(i-1) = a constant),