DATA MINING
is unstructured querying.
powerpoint presentation on data mining)
The whole point of a database is to residualize relationships among
data items for enterprise use.
Relationships, as studied by ER diagrams and other tools for modeling data
(Data Engineering) are described using relations (or tables).
The relation, R(A1,..An) defined on domains D1,...,Dn is a degree=n
relationship among values in the n domains
Any relationship can be diagrammed (or pictured) using a graph.
The graph of the relation(ship), R(A1,...,An) is an n-partite undirected graph
in which the n-way hyper-edges interconnect values from D1,...,Dn.
This hypergraph is difficult to draw usefully on a 2-D plane
(sheet of paper), and therefore is seldom attempted.
However, for a degree=2 relationship, R(A1,A2), drawing the bipartite
graph is very helpful in understanding and studying the relationship.
The bipartite graph is often called an x-y scatter plot, where
x is a variable on D1 and y is a variable on D2 and
each plot point represents a related pair (edge in the graph)
The relational model is a horizontal model in which the focus is on the edges
(horizontal data structure listing the nodes involved in that edge plus,
possible, node labels and/or edge labels).
A Vertical Model focuses on the nodes.
E.g., for each D1-entity-node, the D1-centric vertical model
associates (or bit maps) the set of D2-entity-nodes
that are related to that D1-node.
And, for each D2-entity-node, the D2-centric vertical model
associates (or bit maps) the set of D1-entity-nodes
that are related to that D2-node.
Market basket data (MBR)
Data is organized vertically as a
TRANSACTION TABLE with 2 attributes: T(Tid, Itemset)
- A transaction is a customer transaction at a cash register.
- Each is given an identifier, Tid.
- Itemset is the set of items in the customer's "basket".
Note that tuples in T are not "flat" (each associated itemset is a "set")
That's can be problematic for analysis, so typically
a transformation is made to the dual, the Boolean or Bitmap model:
Market Basket Data, the Boolean or Bitmapped Model:
Boolean Transaction Table: BT(Tid, Item-1, Item-2,... Item-n)
Tid is a transaction identifier
Each Item-i column is a Boolean column (or bit vector) which indicates
which items that transactions relates to, by turning on (to 1) only
those bit positions corresponding to related items.
Clearly, we don't want to have to specify the correspondence (map)
between bit positions and items anew for every column. Therefore we
do this mapping once and for all using a Domain Vector Table or DVT.
The DVT need not map the entire domain but only the
Extant Domain (of all currently existing values)
So a 1-bit means that item is in the market basket and a
0-bit means that item is not in the market basket.
Again, in any bipartite relationship between two entities, T and I
(eg, Customer Transactions and purchasable items in Market Basket Research),
there are always two vertical models,
one in which we focus on D1-entity nodes
(and list or map, for each, the set of related D2-entity-nodes)
the other in which we focus on D2-entity nodes
(and list or map, for each, the set of related D1-entity-nodes)
Thus in MBR, one has the dual vertical model, I(Iid, TransSet)
which, for each item, lists of bitmaps the transactions involving it.
Note that in MBR T, BT, I and BI usually only record existence/non-existence
of each item in market baskets, not the number of particular items
We can think of it this way: An item id is a UPC (universal Product Code)
or barcode only (identifying the "type" not the instances of that item).
With nano-RFID tags, ePC will be used (electronic Product Code) wich not
only distinguishes type but also instance of an item (like VINs for autos)
When RFID item identification becomes ubiquitous, we will need to analyze
by ePC not just UPC.
Much research still needs to be done when analyzing the data
where the number of each item (the counts) are imporant (e.g., ePC tagged items)
We can treat that situation by identifying instances of items and using T or BT
above, or by using
COUNT TRANSACTION TABLE: CT(Tid, Item-1, Item-2, ..., Item-n),
where values are the counts of each (UPC id-ed) item in that trans.
Note: T(Tid,ItemSet) and BT(Tid,I1..In)
don't take account of hierarchical structures of items
- e.g., in T, milk is an item at one level,
- milk breaks down into skim, whole, 2%.. at a finer level...
- Work on hierarchical MBR is ongoing. New ideas are welcome!
In the Market Basket Research (MBR) models , these Vertical Boolean Tables are
- extremely wide (many many columns)
- sometimes very shallow (not many trans e.g., Cancer data - few samples)
- extremely sparse (mostly 0's - i.e., no customer buys most of the
items in the store in one shopping trip!)
Bioinformatics/genetics data is remarkably similar.
Microarray Data Analysis (MDA) is the analysis of the gene expression levels
of genes spotted on glass slides (Microarrays or Gene Chips) and subjected to
"treatments" or experiments that record the gene expression level
before (using red dye) and
after (using green dye).
For each experiment and each spotted gene, the logarithm of the ratio of
red/green is recorded in a Microarray Data Analysis Table.
MDA is usually stored as an Excel spreadsheet:
GeneTable: GT(Gid,E1...En)
row = gene
column = experiment (plus other columns)
value = log ratio of r/g (a Real Number)
BinaryGeneTable: BGT(Gid,E1...En)
is the table you get by setting a threshold
expression ratio and recording 1 iff it is exceeded:
Note: sometimes 3-value logic is used, in which there is an expression
threshold and a repression threshold. We will call that the
TernaryGeneTable: TGT(Gid, E1,...,En)
BinaryExperimentTable: BGT(Eid, Gene1,...,GeneN) is similar to BT in MBR
****************************************************************
Formally in MBR, the BTT is defined as follows:
I={i1..im} is the set of items.
=====
- eg, an item for purchase in a store
- Each item in a store is an attribute, Ai,
- with Boolean values (1 = in a customer's "market basket
or shopping cart" and 0=not in it).
An itemsets is a subset of I, (eg, set of items in a store)
========
A k-itemset is an itemset of cardinality (size) k
=========
D={ti..tn} are transactions (eg, customer transaction at checkout)
============
Each ti has an identifier and an itemset, ti = (t-id, t-itemset)
A transaction,t,
SUPPORTS an itemset,A, if A IS CONTAINED IN t-itemset.
========
An Association rule is an implication A => C,
================ where A and C are disjoin itemsets.
( A = antecedent and C = consequent)
- rules have quality or interestingness measures,
two of which are support and confidence:
SUPPORT OF ITEMSET B is the ratio, s, of transactions containing B
==================
SUPPORT OF RULE A=>C is the support of A u C
===============
CONFIDENCE OF A=>C = fraction of those trans suppporting A that
========== also support C.
- conf(A=>C) = supp(AuC) / supp(A)
- The confidence measures the strength of the implication.
- Both support and confidence can be measured as %'s.
- As a %, confidence is the conditional probability, P(C|A)
FREQUENT ITEMSETs are those with support >= a threshold, minsupp.
================
- The set of frequent k-itemsets is denoted Lk.
CONFIDENT RULEs are those with confidence >= a threshold, minconf.
==============
STRONG RULEs are confident rules with frequent support sets.
===========
Given a user specified minsupp and minconf, our first task is to
find ALL strong rules, called Association Rule Mining, ARM, using:
=======================
1. Find all frequent itemsets, Lk. (for each k > 1)
2. For each frequent itemset, B,
find all strong rules supported by that frequent itemset
(find all antecedent subsets, A s.t. A==>B-A is strong)
- the performance of ARM is largely determined by 1.
APRIORI ALGORITHM
=================
Based on the algorithm pruning technique:
Any subset of frequent itemset must also be frequent.
FINDING FREQUENT ITEMSETS:
Start by finding all frequent 1-itemsets, L1.
Then candidates for L2 consist of joins of sets from L1,
where 2 itemsets "join" if they're identical except for 1 member.
Let Ck = set of Candidate k-itemsets ( Lk-1 JOIN Lk-1 )
1st Iteration: Scan D for L1.
Kth Iteration: Create Ck as Lk-1 JOIN Lk-1
Scan Ck for the frequent k-itemsets, Lk
GENERATING STRONG RULES:
For each B is in Lk, find all strong rules, A => B-A.
A' < A -> supp(A') >= supp(A) -> conf(A'=>B-A') <= conf(A=>B-A)
If A is not a strong-rule-antecedent in B, then A' isn't either.
So, for each L in Lk,
1. start with largest antecedent sets (k-1 item subsets)
2. next consider only (k-2)-item antecedents for which
every (k-1)-item SuperAntecedent produced a strong rule
Said another way (better way?),
Consider only those 2-item consequents for which both
1-item subsets were strong rule consequents
SUMMARY:
1. supp(B) = |{t: B is a subset of t-itemset}| / |D|
2. supp(A=>C) = supp(AuC)
3. conf(A=>C) = supp(AuC)/supp(A)
APRIORI
4. Scan D to find L1
5. Form candidate 2-itemsets, C2, as L1 JOIN L1
6. Scan C2 for L2;
...
7. For each Lk,
find strong minimal consequents;
find strong minimal superset consequents of those, etc.
EXAMPLE 1:
I = {a,b,c,d,e}; D = {100,200,300,400}
Sample transaction database:
TID ItemLists
--- ------------------------------
100 a c d
200 b c e
300 a b c e
400 b e
minsupp=50% (itemset is frequent if >= 2 transactions support it)
minconf=60% (rule is confident if >= conditional prob >= .6)
The process of finding frequent itemsets:
C1 C2 C3 C4
Iset Sup Freq Iset Sup Freq Iset Sup Freq Freq Iset gen ends.
{a} 2 y {a,b} 1 {b,c,e} 2 y
{b} 3 y {a,c} 2 y
{c} 3 y {a,e} 1
{d} 1 {b,c} 2 y
{e} 3 y {b,e} 3 y
{c,e} 2 y
Derive association rules.
For frequent 3-itemsets, start with 1-item consequents: conf?
Rule1: b^c ==> e, confidence = 100%. =Sup{b,c,e}/Sup{b,c} y
Rule2: b^e ==> c, confidence = 66.7%. =Sup{b,c,e}/Sup{b,e} y
Rule3: c^e ==> b, confidence = 100%. =Sup{b,c,e}/Sup{c,e} y
Form all 2-item consequents from high-conf 1-item consequents:
Rule4: b ==> c^e, confidence = 66.7%. =Sup{b,c,e}/Sup{b} y
Rule5: c ==> b^e, confidence = 66.7%. =Sup{b,c,e}/Sup{c} y
Rule6: e ==> b^c, confidence = 66.7%. =Sup{b,c,e}/Sup{e} y
For each frequent 2-Isets, start with 1-item consequents:
For {a,c}
Rule7: a ==> c, confidence = 100% = Sup{a,c}/Sup{a} y
Rule8: c ==> a, confidence = 66.7% = Sup{a,c}/Sup{c} y
For {b,c}
Rule9: b ==> c, confidence = 66.7% = Sup{b,c}/Sup{b} y
Rule10: c ==> b, confidence = 66.7% = Sup{b,c}/Sup{c} y
For {b,e}
Rule11: b ==> e, confidence = 100% = Sup{b,e}/Sup{b} y
Rule12: e ==> b, confidence = 100% = Sup{b,e}/Sup{e} y
For {c,e}
Rule13: c ==> e, confidence = 66.7% = Sup{c,e}/Sup{c} y
Rule14: e ==> c, confidence = 66.7% = Sup{c,e}/Sup{e} y
All 14 rules are high-confidence.
ESAMPLE 2:
minconf=80%, minsupp=50%:
We get the same frequent itemsets (since we have the same minsupp)
C1 C2 C3 C4
Iset Sup Freq Iset Sup Freq Iset Sup Freq Iset gen ends.
{a} 2 y {a,b} 1 {b,c,e} 2 y
{b} 3 y {a,c} 2 y
{c} 3 y {a,e} 1
{d} 1 {b,c} 2 y
{e} 3 y {b,e} 3 y
Derive association rules.
For frequent 3-itemsets, start with 1-item consequents: conf?
Rule1: b^c ==> e, confidence = 100%. =Sup{b,c,e}/Sup{b,c} y
Rule2: b^e ==> c, confidence = 66.7%. =Sup{b,c,e}/Sup{b,e}
Rule3: c^e ==> b, confidence = 100%. =Sup{b,c,e}/Sup{c,e} y
then all 2-item consequents from high-conf 1-item consequents:
Rule5: c ==> b^e, confidence = 66.7%. =Sup{b,c,e}/Sup{c}
For each frequent 2-Isets, start with 1-item consequents:
For {a,c}
Rule7: a ==> c, confidence = 100% = Sup{a,c}/Sup{a} y
Rule8: c ==> a, confidence = 66.7% = Sup{a,c}/Sup{c}
For {b,c}
Rule9: b ==> c, confidence = 66.7% = Sup{b,c}/Sup{b}
Rule10: c ==> b, confidence = 66.7% = Sup{b,c}/Sup{c}
For {b,e}
Rule11: b ==> e, confidence = 100% = Sup{b,e}/Sup{b} y
Rule12: e ==> b, confidence = 100% = Sup{b,e}/Sup{e} y
For {c,e}
Rule13: c ==> e, confidence = 66.7% = Sup{c,e}/Sup{c}
Rule14: e ==> c, confidence = 66.7% = Sup{c,e}/Sup{e}
Only Rules 1,3,7,11,12 are high-confidence.
EXAMPLE 3: mconf=80%, msup=70%
TId Items
--- -----------------------------
100 a c d
200 b c e
300 a b c e
400 b e
We get new frequent itemsets.
Cand_1-Isets Cand_2-Isets Cand_3-Isets is empty.
Iset Sup Freq Iset Sup Freq Freq Iset generation
{a} 2 {b,c} 2 ends.
{b} 3 y {b,e} 3 y
{c} 3 y {c,e} 2
{d} 1
{e} 3 y
Derive association rules.
For frequent 2-itemsets, start with 1-item consequents: conf?
Rule1: b => e, confidence = 100%. =Sup{b,e}/Sup{b} y
Rule2: e => b, confidence = 100%. =Sup{b,e}/Sup{e} y
Rules 1,2 are high-confidence.
EXAMPLE 4: mconf=80%, msup=80%
We get new frequent itemsets.
TId Items
--- -----------------------------
100 a c d
200 b c e
300 a b c e
400 b e
Cand_1-Isets Cand_2-Isets is empty.
Iset Sup Freq Freq Iset generation
{a} 2 ends.
{b} 3
{c} 3
{d} 1
{e} 3
derive association rules. There are no frequent itemsets. done.
These examples should demonstrate how much the pruning rules
simplify the cases with higher support and confidence.
HASH-BASED techniques (hashing itemset counts)
=====================
- To reduces the size of Ck for k > 1 (especially k=2), while
scanning D to determine which itemsets in Ck are to be in Lk
create a hash table of counts of (k+1)-itemsets
Example:
Take a transaction universe, D = {I1,I2,I3,I4,I5} as follows:
Tid T-itemset
---- --------------
T100 | I1 I2 I5
T200 | I2 I4
T300 | I2 I3
T400 | I1 I2 I4
T500 | I1 I3
T600 | I2 I3
T700 | I1 I3
T800 | I1 I2 I3 I5
T900 | I1 I2 I3
T1000| I1 I4
minsupp_cnt = 6
While finding frequent 1-itemsets, creating count histogram
of the form:
Itemset Support
------- -------
{I1}
{I2}
{I3}
{I4}
{I5}
Also create hash table by hashing (Ix,Iy) using
H2(x,y)= ( x*5 + y )MOD7
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
-----------|-----|-----|-----|-----|-----|-----|-----|
bucket_cnt | | | | | | | |
-----------|-----|-----|-----|-----|-----|-----|-----|
buck_content | | | | | | |
(Note that the bucket_content is just include to aid the reader)
Starting scan at: T100 | I1 I2 I5
Itemset Support
------- -------
{I1} 1
{I2} 1
{I3}
{I4}
{I5} 1
H2(1,2)= (1*5+2=7)MOD7 = 0
H2(1,5)= (1*5+5=10)MOD7= 3
H2(2,5)= (2*5+5=15)MOD7= 1
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 1 | 1 | | 1 | | | |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5| |I1 I5| | | |
| | | | | | | |
Continuing scan at: T200 | I2 I4
Itemset Support
------- -------
{I1} 1
{I2} 2
{I3}
{I4} 1
{I5} 1
H2(2,4)= (2*5+4=14)MOD7= 0
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 2 | 1 | | 1 | | | |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5| |I1 I5| | | |
|I2 I4| | | | | | |
Continuing scan at: T300 | I2 I3
H2(2,3)= (2*5+3=13)MOD7= 6
Itemset Support
------- -------
{I1} 1
{I2} 3
{I3} 1
{I4} 1
{I5} 1
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 2 | 1 | | 1 | | | 1 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5| |I1 I5| | |I2 I3|
|I2 I4| | | | | | |
Continuing scan at: T400 | I1 I2 I4
H2(1,2)= (1*5+2=7)MOD7= 0
H2(1,4)= (1*5+4=9)MOD7= 2
H2(2,4)= (2*5+4=14)MOD7=0
Itemset Support
------- -------
{I1} 1
{I2} 4
{I3} 1
{I4} 2
{I5} 1
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 4 | 1 | 1 | 1 | | | 1 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5|I1 I4|I1 I5| | |I2 I3|
|I2 I4| | | | | | |
|I1 I2| | | | | | |
|I2 I4| | | | | | |
Continuing scan at: T500 | I1 I3
H2(1,3)= (1*5+3=8)MOD7= 1
Itemset Support
------- -------
{I1} 2
{I2} 4
{I3} 2
{I4} 2
{I5} 1
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 4 | 2 | 1 | 1 | | | 1 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5|I1 I4|I1 I5| | |I2 I3|
|I2 I4|I1 I3| | | | | |
|I1 I2| | | | | | |
|I2 I4| | | | | | |
Continuing scan at: T600 | I2 I3
H2(2,3)= (2*5+3=13)MOD7= 6
Itemset Support
------- -------
{I1} 2
{I2} 5
{I3} 3
{I4} 2
{I5} 1
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 4 | 2 | 1 | 1 | | | 2 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5|I1 I4|I1 I5| | |I2 I3|
|I2 I4|L1 I3| | | | |I2 I3|
|I1 I2| | | | | | |
|I2 I4| | | | | | |
Continuing scan at: T700 | I1 I3
H2(1,3)= (1*5+3=8)MOD7= 1
Itemset Support
------- -------
{I1} 3
{I2} 5
{I3} 4
{I4} 2
{I5} 1
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 4 | 3 | 1 | 1 | | | 2 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
buck_content|I1 I2|I2 I5|I1 I4|I1 I5| | |I2 I3|
(discontinue|I2 I4|I1 I3| | | | |I2 I3|
showing |I1 I2|I1 I3| | | | | |
buck_content)I2 I4
Continuing scan at: T800 | I1 I2 I3 I5
H2(1,2)= (1*5+2=7)MOD7= 0
H2(1,3)= (1*5+3=8)MOD7= 1
H2(1,5)= (1*5+5=10)MOD7=3
H2(2,3)= (2*5+3=13)MOD7=6
H2(2,5)= (2*5+5=15)MOD7=1
H2(3,5)= (3*5+5=20)MOD7=6
Itemset Support
------- -------
{I1} 4
{I2} 6
{I3} 5
{I4} 2
{I5} 2
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 5 | 5 | 1 | 2 | | | 4 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
Continuing scan at: T900 | I1 I2 I3
H2(1,2)= (1*5+2=7)MOD7= 0
H2(1,3)= (1*5+3=8)MOD7= 1
H2(2,3)= (2*5+3=13)MOD7=6
Itemset Support
------- -------
{I1} 5
{I2} 7
{I3} 6
{I4} 2
{I5} 2
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 6 | 6 | 1 | 2 | | | 5 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
Continuing scan at: T1000| I1 I4
H2(1,4)= (1*5+4=9)MOD7= 2
Itemset Support
------- -------
{I1} 6
{I2} 7
{I3} 6
{I4} 3
{I5} 2
.-----------------------------------------------------.
|bucket_addr| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt | 6 | 6 | 2 | 2 | | | 5 |
`-----------+-----+-----+-----+-----+-----+-----+-----'
Since Minsup_cnt = 6, L1 = {I1,I2,I3}
The usual C2 would be { {I1,I2}, {I1,I3}, {I2,I3} }
but by first applying H2 we see that C2 can be pruned
to { {I1,I2} {I1,I3} } since H2({I2,I3})=6 and that bucket
count is only 5.
DHP:
===
The above is an introduction to the DHP
(Direct Hashing and Pruning) methods.
DIC:
===
Dynamic Itemset Counting method begins to count
cand 3-itemsets before completing the count of cand 2-itemsets,
cand 4-itemsets before completing the count of cand 3-itemsets,
etc. This reduces the number of database scans required.
TRANSACTION REDUCTION
=====================
(2nd method for improving the efficiency of Apriori)
- A trans that does not contain any frequent k-itemsets
cannot contain any frequent (k+1)-itemsets.
Therefore it can be removed.
PARTITIONING (partitioning the data to find candidate itemsets)
============
- partition D into n partitions (each with a minsup' of minsup/n)
- throw out those transactions that don't achieve minsup' in any
partition.
SAMPLING (mining on a subset of the given data)
========
- Pick a sample, S;
Look for frequent itemsets in S (may miss some)
FP-growth:
=========
Another method for improving the efficiency of finding
frequent itemsets is the FP-growth method in which a
complex data structure is constructed from which all
frequent itemsets can be determined without doing
additional database scans.
In this method we try to reduce time required to find frequent
item sets by going right to the isolation of frequent itemsets
without first generating candidate frequent item sets. This
will reduce the size and number of database scans required.
Assume a minimum support count of 2.
TID Items
T100 | I1 I2 I5
T200 | I2 I4
T300 | I2 I3
T400 | I1 I2 I4
T500 | I1 I3
T600 | I2 I3
T700 | I1 I3
T800 | I1 I2 I3 I5
T900 | I1 I2 I3
T1000| I3 I4
First scan the database for frequent 1-itemsets and
sort them in order of descending support count:
L-order: I2:7, I1:6, I3:6, I4:2, I5:2
Then create the root of the FP-tree and label it null:
(_) null
Scan the database processing items in L-order.
Create a branch in L-order for each trans (label nodes item:count)
T100 | I1 I2 I5
Item_Header_Table ....(_) null
Item Cnt Link I2:1_/
----- --- ---- .---- > (_)
{I2} 1-------' I1:1_/
{I1} 1----------- > (_)
{I3} 0 /
{I4} 0 I5:1_/
{I5} 1------- > (_)
To facilitate tree traversal, an Item_Header_Table (IHT) is built
so each item is linked to its occurrences in the tree.
Continuing: T200 | I2 I4
Since we already have a I2:1 node linked from the root we share it
and increment its count (always share prefixes with existing paths
Item_Header_Table ....(_) null
Item Cnt Link I2:2_/
----- --- ---- .---- > (_)
{I2} 2-------' I1:1_/ \ _I4:1
{I1} 1----------- > (_) (_)
{I3} 0 / ^
{I4} 1-. I5:1_/ :
{I5} 1------- > (_) :
: :
`------------------'
Continuing: T300 | I2 I3
____
Item_Header_Table ...(null)
Item Cnt _/__
----- --- .--------(I2:3)
{I2} 3-------' ____/ _|__ \
{I1} 1----------(I1:1) (I4:1) (I3:1)
{I3} 1-------------/------:----'
{I4} 1-. ____/ :
{I5} 1-------(I5:1) :
: :
`------------------'
Continuing: T400 | I2 I1 I4
____
Item_Header_Table ...(null)
Item Cnt _/__
----- --- .--------(I2:4)
{I2} 4-------' ____/ _|__ \
{I1} 2----------(I1:2) (I4:1) (I3:1)
{I3} 1-------------/-|--:--:---'
{I4} 2-. ____/ | . .
{I5} 1-.-----(I5:1) (I4:1) .
. .
`-------------------'
Continuing: T500 | I1 I3
.........................
Item_Header_Table : ...(null).... :
Item Cnt : _/__ \_:__
----- -- ..........:.....(I2:4) (I1:1)
{I2} 4.: :___/ _\__ \ _\__
{I1} 3..........(I1:2) (I4:1) (I3:1)---(I3:1)
{I3} 2............/.\...:..:...'
{I4} 2.. ___/ \__: :
{I5} 1.:.....(I5:1) (I4:1) :
:...................:
Continuing: T600 | I2 I3
.........................
Item_Header_Table : ...(null).... :
Item Cnt : _/__ \_:__
----- -- ..........:.....(I2:5) (I1:1)
{I2} 5.: :___/ _\__ \ _\__
{I1} 3..........(I1:2) (I4:1) (I3:2)---(I3:1)
{I3} 3............/.\...:..:...'
{I4} 2.. ___/ \__: :
{I5} 1.:.....(I5:1) (I4:1) :
:...................:
Continuing: T700 | I1 I3
.........................
Item_Header_Table : ...(null).... :
Item Cnt : _/__ \_:__
----- -- ..........:.....(I2:5) (I1:2)
{I2} 5.: :___/ _\__ \ _\__
{I1} 4..........(I1:2) (I4:1) (I3:2)---(I3:2)
{I3} 4............/.\...:..:...'
{I4} 2.. ___/ \__: :
{I5} 1.:.....(I5:1) (I4:1) :
:...................:
Continuing: T800 | I2 I1 I3 I5
.........................
Item_Header_Table : ...(null).... :
Item Cnt : __/____ \_:__
----- -- ..........:....(__I2:6_) (I1:2)
{I2} 6.: :___/ _\____ \ _\__
{I1} 5..........(I1:3) (_I4:1_) (I3:2) (I3:2)
{I3} 5............/.\.\:......:...' `.....' :
{I4} 2.. ___/ __\_:\ : :
{I5} 2.:....(I5:1)(I4:1)(I3:1): :
:.....:............\..:: :
: \ :...............:
: __\_
: (I5:1)
:.............:
Continuing: T900 | I2 I1 I3
.........................
Item_Header_Table : ...(null).... :
Item Cnt : __/____ \_:__
----- -- ..........:....(__I2:7_) (I1:2)
{I2} 7.: :___/ _\____ \ _\__
{I1} 6..........(I1:4) (_I4:1_) (I3:2) (I3:2)
{I3} 5............/.\.\:......:...' `.....' :
{I4} 2.. ___/ __\_:\ : :
{I5} 3.:....(I5:1)(I4:1)(I3:2): :
:.....:............\..:: :
: \ :...............:
: __\_
: (I5:1)
:.............:
This give us all the frequent patterns of any length
and therefore no further database scans are necessary.
- tremendous time savings!
- only two database scans necessary and then extensive
processing of the FP-tree.
Mining Distance-based Association Rules
=======================================
- Previous section described QARs where quantitative attribs
are discretized initially by binning,
then the intervals are combined.
- Such an approach may not capture semantics since it ignores
distances
- A distance-based AR method captures the semantics of interval
data while allowing for approximation in data values.
- a 2-phase algorithm can be used to mine distance based ARs.
- The FIRST PHASE employs clustering to find intervals or clusters
- a density threshold and a frequency threshold are required of
a cluster (must be close and numerous)
- The SECOND PHASE obtains distance-based ARs by searching for
groups of clusters that occur frequently together.
- To conclude that Ac => Cc, we want the antecedent-cluster, Ac,
when projected onto the consequent-space to be within the
consequent cluster, Cc.
ARM and Correlation Analysis
============================
- most ARM methods employ a support-confidence framework to
weed out uninteresting or misleading rules.
- even strong ARs can be uninteresting or misleading.
- methods of statistical independence and correlation analysis
can help weed them out.
- Example:
MISLEADING RULES and REDUNDANT RULES
====================================
In MBR basket case, consider T={tea} and C={coffee} |D|=100 trans.
MISLEADING:
coffee NOTcoffee|total
.---------------|---- Conf(T=>C)=|TUC|/|T|= 20/25= .8
tea | 20 5 | 25 Conf(D=>C)=supp(C)=
NOTtea | 70 5 | 75 |C|/|D|=90/100= .9
------|---------------|---- So the rule T >C is misleading.
total | 90 10 | 100
REDUNDANT:
coffee NOTcoffee| tot
.-----------------|---- C(T=>C)=20/22= .9090
tea | 20 2 | 22 C(D=>C)=90/100= .9000
NOTtea | 70 8 | 78
------|-----------------|---- Within .0090 of each other
total | 90 10 | 100 so they are redundant rules.
Text Association Rule Mining:
----------------------------
Given an alphabet, A, the
12 n
nWordUniverse, Wn = A u AA u AAA u ... U AA..A
where, eg, AA = {ab | a and b are distinct and belong to A}.
Let W be a subset of U(i=1..n)Wi for some n (the DICTIONARY)
Let S be a subset of U(i=1..m)Wi some m (the SENTENCES)
- m is usually bigger than n
We can do ARM on the Universe of "Items", W, and
"Transactions, S, as above.
- A HIGH CONFIDENCE RULE, A => F (A,F are disjoint WordSets)
tells us: if all the A-words occur in a sentence then, with
high confidence, all the F-words will also.
- If the rule, A => F, has HIGH SUPPORT, that means all the
A-words and all the F-words occur in most of the sentences.
There are alternate ways to deal with such text situations.
Machine Learning (ML)
is an older term for Data Mining, which included 2, CLASSIFICATION and CLUSTERING,
of the 3 Data Mining areas of: Assoc. Rule Minning, Classification and Clustering.
A still older term, Artificial Intelligence (AI), included all of these and much more.
CLASSIFICATION is the central area of the three!
Given a (large) TRAINING SET T(A1, ..., An, C) with CLASS C
and FEATURES A&equiv(A1,...,An)
C-CLASSIFICATION of an unclassified
sample, (a1,...,an) is just:
SELECT Max (Count (T.Ci))
FROM T
WHERE T.A1 = a1
AND T.A2 = a2
...
AND T.An = an
GROUP BY T.C;
i.e., just a SELECTION, since C-Classification is assigning to (a1..an)
the most frequent C-value in RA=(a1..an).
But, if the EQUALITY SELECTION is empty,
then we need a FUZZY QUERY to find NEAR NEIGHBORs (NNs)
instead of exact matches.
That's Nearest Neighbor Classification (NNC).
If SQL had a good Nearest Neighbor Set operator, we would be done.
But it doesn't, so NNC is essentially building a good NEAR NEIGHBOR operator.
E.g.,
Medical Expert System (Ask a Nurse): Symptoms plus past diagnoses
are collected into a table called CASES
For each undiagnosed new_symptoms,
CASES is searched for matches: SELECT DIAGNOSIS
FROM CASES
WHERE CASES.SYMPTOMS = new_symptoms;
If there is a predominant DIAGNOSIS,
Then report it,
ElseIf there's no predominant DIAGNOSIS,
Then Classify instead of Query, i.e.,
find fuzzy matches (near nbrs) SELECT DIAGNOSIS
FROM CASES
WHERE CASES.SYMPTOMS ≅ new_symptoms
Else call your doctor in the morning
That's exactly (Nearest Neighbor) Classification!!
CAR TALK radio show: Click and Clack the Tappet brothers have a vast
TRAINING SET of car problems and solutions built from experience.
They search that TRAINING SET for close matches to predict solutions
based on previous successful cases.
That's exactly (Nearest Neighbor) Classification!!
We all perform Nearest Neighbor Classification every day of our lives.
E.g., We learn when to apply specific programming/debugging techniques so that
we can apply them to similar situations thereafter.
COMPUTERIZED NNC &equiv MACHINE LEARNING
(most clustering (which is just partitioning) is done as
a simplifying prelude to classification).
Again, given a TRAINING SET, R(A1,..,An,C), with C=CLASSES and (A1..An)=FEATURES
Nearest Neighbor Classification (NNC) &equiv
selecting a set of R-tuples with similar features (to the unclassified sample)
and then letting the corresponding class values vote.
Nearest Neighbor Classification won't work very well if
the vote is inconclusive (close to a tie)
or if similar (near) is not well defined, then we
build a MODEL of TRAINING SET
(at, possibly, great 1-time expense?)
When a MODEL is built first the technique is called Eager classification,
whereas
model-less methods like Nearest Neighbor are called Lazy or Sample-based.
Eager Classifiers models can be:
decision trees,
probabilistic models (Bayesian Classifier),
Neural Networks,
Support Vector Machines, ...
How do you decide when an EAGER model is good enough to use?
How do you decide if a Nearest Neighbor Classifier is working well enough?
We have a TEST PHASE.
typically, we set aside some training tuples as a Test Set.
(then, of course, those Test tuples cannot be used in model building or
and cannot be used as nearest neighbors)
If the classifier passes the the test
(a high enough % of Test tuples are correctly
classified by the classifier) it is accepted.
EXAMPLE 1:
Computer Ownership TRAINING SET for predicting who owns a computer:
Customer Age Salary Job Owns Computer
1 | 24 | 55,000 | Programmer | yes
2 | 58 | 94,000 | Doctor | no
3 | 48 | 14,000 | Laborer | no
4 | 58 | 19,000 | Domestic | no
5 | 28 | 18,000 | Construction| no
A classifier might be built from this TRAINING SET (e.g., a decision tree) as follows:
Age < 30
/ \
T F
/ \
Salary > 50K No
/ \
T F
/ \
Yes No
It is easy to determine a pattern in this small dataset, however for large
datasets it is impossible to construct a decision tree model by "sight".
Therefore we need a Model Building Algorithm or training algorithm
EXAMPLE 2:
PRECISION AG YIELD CLASSIFIER predicts YIELD of a field grid cell
based on mid-year Blue, Green, Red, NearInfraRed reflectances from that cell.
The TRAINING SET is R(CELL, Blue, Green, Red, NIR, YIELD) from previous year.
1st Separate out a Test Set.
2nd Build a CLASSIFIER MODEL (decision tree) from remaining TRAINING SET
3rd Test MODEL accuracy using the Test Set. If it passes the test,
then when an aerial photo is taken during the growing season,
predict where low yeild can be expected using the MODEL
(then apply additional nutrients to those cells?)
TRAINING SET
X Y Blue_____ Green____ Red_____ NIR_____ YIELD_
0 0 | 0000 1001 | 1010 1111 | 0000 0110 | 1111 0101 | medium
0 1 | 0000 1011 | 1011 0100 | 0000 0101 | 1111 0111 | medium
0 2 | 0000 1011 | 1011 0101 | 0000 0100 | 1111 0111 | high
0 3 | 0000 0111 | 1011 0111 | 0000 0011 | 1111 1000 | high
0 4 | 0000 0111 | 1011 1011 | 0000 0001 | 1111 1001 | high
0 6 | 0000 1000 | 1011 1111 | 0000 0000 | 1111 1011 | high
1 0 | 0001 1101 | 1010 1110 | 0000 0111 | 1111 0100 | medium
2 1 | 0000 1111 | 1011 0101 | 0000 0110 | 1111 0110 | medium
3 2 | 0001 1111 | 1011 0111 | 0000 0101 | 1111 0110 | medium
4 3 | 0001 1111 | 1011 0110 | 0000 0010 | 1111 1000 | high
5 4 | 0001 1111 | 1011 1010 | 0000 0010 | 1111 1000 | high
6 6 | 0001 1111 | 1011 1110 | 0000 0001 | 1111 1010 | high
Separate out as TEST SET
X Y Blue_____ Green____ Red_____ NIR_____ YIELD_
1 0 | 0001 1101 | 1010 1110 | 0000 0111 | 1111 0100 | medium
2 1 | 0000 1111 | 1011 0101 | 0000 0110 | 1111 0110 | medium
3 2 | 0001 1111 | 1011 0111 | 0000 0101 | 1111 0110 | medium
4 3 | 0001 1111 | 1011 0110 | 0000 0010 | 1111 1000 | high
5 4 | 0001 1111 | 1011 1010 | 0000 0010 | 1111 1000 | high
6 6 | 0001 1111 | 1011 1110 | 0000 0001 | 1111 1010 | high
TRAIN a Classifier with the remainder (a decision tree)
REMAINDER of the TRAINING SET
X Y Blue_____ Green____ Red_____ NIR_____ YIELD_
0 0 | 0000 1001 | 1010 1111 | 0000 0110 | 1111 0101 | medium
0 1 | 0000 1011 | 1011 0100 | 0000 0101 | 1111 0111 | medium
0 2 | 0000 1011 | 1011 0101 | 0000 0100 | 1111 0111 | high
0 3 | 0000 0111 | 1011 0111 | 0000 0011 | 1111 1000 | high
0 4 | 0000 0111 | 1011 1011 | 0000 0001 | 1111 1001 | high
0 6 | 0000 1000 | 1011 1111 | 0000 0000 | 1111 1011 | high
____________________________________
/ | \
/ | \
/ | \
/ | \
NIR ≤ 01000000 01000000 < NIR ≤ 11110111 11110111 < NIR
^ Red ≥ 00100000 ^ 00100000 > Red ≥ 00000101 ^ 00000101 > Red
/ | \
/ | \
YIELD = low YIELD = medium YIELD = high
TEST Classifier
TEST SET
X Y Blue_____ Green____ Red_____ NIR_____ YIELD_ PREDICTED YIELD
1 0 | 0001 1101 | 1010 1110 | 0000 0111 | 1111 0100 | medium | medium
2 1 | 0000 1111 | 1011 0101 | 0000 0110 | 1111 0110 | medium | medium
3 2 | 0001 1111 | 1011 0111 | 0000 0101 | 1111 0110 | medium | medium
4 3 | 0001 1111 | 1011 0110 | 0000 0010 | 1111 1000 | high | high
5 4 | 0001 1111 | 1011 1010 | 0000 0010 | 1111 1000 | high | high
6 6 | 0001 1111 | 1011 1110 | 0000 0001 | 1111 1010 | high | high
Tests out to be 100% correct (Gets and A grade!).
USE Classifier Model (decision tree) to classify:
New Data: R,G,B,NIR from an aerial image taken on ~4th of July:
X Y Blue_____ Green____ Red______ NIR______
0 6 | 0001 1100 | 1011 1110 0000 0001 | 1111 1110
___________________ ===================
/ | \\
/ | \\
/ | \\
/ | \\
NIR ≤ 01000000 01000000 < NIR ≤ 11110111 1111 0111 < NIR
^ Red ≥ 00100000 ^ 00100000 > Red ≥ 00000101 ^ 0000 0101 > Red
/ | \\
/ | \\
YIELD = low YIELD = medium YIELD = high
Preparing Data for Classification
Data Cleaning (of noise and missing values)
Remove Noise (or reduce noise) by "smoothing"
Fill in missing values (with most common or some statistical value)
NOTE: Even Noise and Missing Value management can be done by a
Nearest Neighbor Vote! (called interpolation)
Feature Extraction to eliminate irrelevant attributes (e.g., in the PA example,
eliminate Blue, Green since they're irrelevant to the decision).
Ways of Comparing Different Classification Methods
Predictive Accuracy (predicting the class label of new data)
Speed (computation costs for generating and using the model)
Robustness (does it give almost the same predictions when
the Training Set are almost the same?)
Scalability (Model construction efficiency - massive datasets)
More Detail on Some Classification Methods:
K-Nearest-Neighbor Classification
Decision Tree Models for EAGER CLASSIFICATION:
each inode is a test on a feature attribute (composite?),
each test outcome is assigned a link to the next level
(outcome=a value or range of values or...)
each leaf represent a class (or distribution of classes)
Unknown sampes are classified by their testing feature attributes against the tree.
The leaf arrived at, holds the class prediction for that sample.
Some branches may represent noise or outliers (and should be pruned?)
The ID3 algorithm for inducing a decision tree from training tuples is:
1. The tree starts as a single node containing the entire TRAINING SET.
2. If all TRAINING TUPLES have the same class, this node is a leaf. DONE.
3. Otherwise, use a measure, information gain, as a heuristic for
selecting the best decision attribute for that node
4. Branch is created for each value [interval of values] of that test attribute
and the TRAINING SET is partitioned accordingly.
5. Recurses on 2,3,4, until The Stopping Condition is true.
Possible Stopping Conditions:
All samples for a given node belong to the same class (label with that class)
∃ no remaining candidate decision attributes (label with plurality class).
Some other stopping rule.
Information Gain as an Attribute Selection Measure
Minimizes expected number of tests needed to classify an object
and guarantees simple tree (not necessarily the simplest)
At any stage, let
S = {s1,...,sm} be a TRAINING SUBSET.
S[C] = {C1,...,Cc} be the distinct classes in S.
EXPECTED INFORMATION needed to classify a sample given S as TRAINING SET is:
I{s1,...,sm} = -∑i=1..mpi*log2(pi) pi= |S∩Ci|/|S|
Choosing A as decision attribute, the
Expected Classification Info gained is
E(A) = ∑j=1..v; i=1..m ( si,j/|S| * I{sij..smj} ) where Skh = SA=ak∩Ch
Gain(A) = I(s1..sm) - E(A)
- expected reduction of info required to classify
after splitting via A-values.
The algorithm above computes the information gain of each
attribute and selects the one with the highest information gain
as the test attribute.
Branches are created for each value of that attribute and samples
are partitioned accordingly.
Pruning
=======
When a decision tree is built, many of the branches will reflect
anomalies in the training data due to noise or outliers.
Tree pruning methods address this problem of "overfitting" the
data (classifying situations that are erroneous or accidental).
Such methods typically use statistical measures to remove the
least reliable branches, resulting in faster classification and
an improvement in the ability of the tree to corredtly classify
independent test data.
Extracting Classification Rules from Decision Tress
One rule per each path from root to leaf.
Each (attr,value) along path forms a conjunction in the antecedent
Leaf holds class prediction or consequent.
May be easier for humans to understand rules.
More on Decision Tree Induction (powerpoint Introduction)
Note that the notion of "near" requires a distance or similarity measure to exist.
What are some of them?
Metrics (distance functions on feature space)
The example:
Training Data:
Band B1: Band B2: Band B3: Band B4:
3 3 7 7 7 3 3 2 8 8 4 5 11 15 11 11
3 3 7 7 7 3 3 2 8 8 4 5 11 11 11 11
2 2 10 15 11 11 10 10 8 8 4 4 15 15 11 11
2 10 15 15 11 11 10 10 8 8 4 4 15 15 11 11
S:
X-Y B1 B2 B3 B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011
Suppose that B1 is the class label attribute (e.g., Yield)
Then the class labels are 2, 3, 7, 10, 15 (C1,..,C5).
We need to know the count of the number of pixels (rows in
the table above) that contain each value in each attribute.
We also need to know the count of pixels that contain pairs of
values, one from a descriptive attribute and the other from the
class label attribute.
Moreover we may wish to focus on only a portion of the dataset
(some part of the field) before making those count calculations.
The Ptree structure is perfect for providing those counts.
B11 B12 B13 B14
0000 0011 1111 1111
0000 0011 1111 1111
0011 0001 1111 0001
0111 0011 1111 0011
BASIC_PTREES_band1___________________
P1,1 P1,2 P1,3 P1,4
5 7 16 11
0 0 1 4 0 4 0 3 4 4 0 3
3 0 0 <-where "different" bit is
VALUE_PTREES_band1___________________ (2-bit precision, 3, etc):
P1(00) P1(01) P1(10) P1(11)
7 4 2 3
4 0 3 0 0 4 0 0 0 0 1 1 0 0 0 3
3 3 0 0
P1(000) P1(010) P1(100) P1(110) P1(001) P1(011) P1(101) P1(111)
0 0 0 0 7 4 2 3
4 0 3 0 0 4 0 0 0 0 1 1 0 0 0 3
3 3 0 0
P1(0000 P1(0100 P1(1000 P1(1100 P1(0010 P1(0110 P1(1010 P1(1110
0 0 0 0 3 0 2 0
0 0 3 0 0 0 1 1
3 3 0
P1(0001 P1(0101 P1(1001 P1(1101 P1(0011 P1(0111 P1(1011 P1(1111
0 0 0 0 4 4 0 3
4 0 0 0 0 4 0 0 0 0 0 3
0
B21 B22 B23 B24
0000 1000 1111 1110
0000 1000 1111 1110
1111 0000 1111 1100
1111 0000 1111 1100
BASIC_PTREES_band2___________________
P2,1 P2,2 P2,3 P2,4
8 2 16 10
0 0 4 4 2 0 0 0 4 2 4 0
02 02 <-positions of the
two 1-bits
VALUE_PTREES_band2___________________
P2(00) P2(01) P2(10) P2(11)
6 2 8 0
2 4 0 0 2 0 0 0 0 0 4 4
13 02
P2(000 P2(010 P2(100 P2(110 P2(001 P2(011 P2(101 P2(111
0 0 0 0 6 2 8 0
2 4 0 0 2 0 0 0 0 0 4 4
13 02
P2(0000 P2(0100 P2(1000 P2(1100 P2(0010 P2(0110 P2(1010 P2(1110
0 0 0 0 2 0 4 0
0 2 0 0 0 0 0 4
13
P2(0001 P2(0101 P2(1001 P2(1101 P2(0011 P2(0111 P2(1011 P2(1111
0 0 0 0 4 2 4 0
2 2 0 0 2 0 0 0 0 0 4 0
1302 02
B31 B32 B33 B34
1100 0011 0000 0001
1100 0011 0000 0001
1100 0011 0000 0000
1100 0011 0000 0000
BASIC_PTREES_band3___________________
P3,1 P3,2 P3,3 P3,4
8 8 0 2
4 0 4 0 0 4 0 4 0 2 0 0
13
VALUE_PTREES_band3___________________
P3(00) P3(01) P3(10) P3(11)
0 8 8 0
0 4 0 4 4 0 4 0
P3(000) P3(010) P3(100) P3(110) P3(001) P3(011) P3(101) P3(111)
0 8 8 0 0 0 0 0
0 4 0 4 4 0 4 0
P3(0000 P3(0100 P3(1000 P3(1100 P3(0010 P3(0110 P3(1010 P3(1110
0 6 8 0 0 0 0 0
0 2 0 4 4 0 4 0
02
P3(0001 P3(0101 P3(1001 P3(1101 P3(0011 P3(0111 P3(1011 P3(1111
0 2 0 0 0 0 0 0
0 2 0 0
13
B41 B42 B43 B44
1111 0100 1111 1111
1111 0000 1111 1111
1111 1100 1111 1111
1111 1100 1111 1111
BASIC_PTREES_band4___________________
P4,1 P4,2 P4,3 P4,4
16 5 16 16
1 0 4 0
1
VALUE_PTREES_band4___________________
P4(00 P4(01 P4(10 P4(11
0 0 11 5
3 4 0 4 1 0 4 0
1 1
P4(000 P4(010 P4(100 P4(110 P4(001 P4(011 P4(101 P4(111
0 0 0 0 0 0 11 5
3 4 0 4 1 0 4 0
1 1
P4(0000 P4(0100 P4(1000 P4(1100 P4(0010 P4(0110 P4(1010 P4(1110
0 0 0 0 0 0 0 0
P4(0001 P4(0101 P4(1001 P4(1101 P4(0011 P4(0111 P4(1011 P4(1111
0 0 0 0 0 0 11 5
3 4 0 4 1 0 4 0
1 1
Suppose we take this relation as training set (4-bit values).
Let B1 be the class label attribute.
Then the classes are:
{ C1,C2,C3,C5,C5 } =
{ 2, 3, 7,10,15 } where Ci={ci}.
The ID3 alg for inducing a decision tree from training samples:
S:
X-Y B1 B2 B3 B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011
1. Tree starts as one node representing the training samples, S.
2. If all samples are in same class (same B1-value)
then S becomes a leaf with that class label. [Not true!]
3. Else, use entropy-based, "information gain" as a heuristic for
selecting the first decision attribute.
Take B2 = (a1,a2,a3,a4,a5} = { 2, 3, 7,10,11 }
as the first candidate attribute.
Aj={t:t(B2)=aj}, where a1=0010, a2=0011, a3=0111,
a4=1010, a5=1011.
sij is number of samples of class, Ci, in a subset, Aj.
so sij = rc( P1(ci)^P2(aj) ), where ci is in {2,3,7,10,15}
and aj is in {2,3,7,10,11}.
++---------+----------+----------+----------+--------
|| P2(2) | P2(3) | P2(7) |P2(10) |P2(11)
|| 2 | 4 | 2 | 4 | 4
--.----------|| 0 2 0 0 | 2 2 0 0 | 2 0 0 0 | 0 0 0 4 | 0 0 4 0
ci| P1(ci) || 13 | 1302 | 02 | |
==+==========++=========+==========+==========+==========+========
2| 3 || 0 | 0 | 0 | 0 | 3
| 0 0 3 0 || | | | | 0 0 3 0
| 3 || | | | | 3
--+----------++---------+----------+----------+----------+--------
3| 4 || 0 | 2 | 2 | 0 | 0
| 4 0 0 0 || | 2 0 0 0 | 2 0 0 0 | |
| || | 13 | 02 | |
--+----------++---------+----------+----------+----------+--------
7| 4 || 2 | 2 | 0 | 0 | 0
| 0 4 0 0 || 0 2 0 0 | 0 2 0 0 | | |
| || 13 | 02 | | |
--+----------++---------+----------+----------+----------+--------
10| 2 || 0 | 0 | 0 | 1 | 1
| 0 0 1 1 || | | | 0 0 0 1 | 0 0 1 0
| 3 0 || | | | 0 | 3
--+----------++---------+----------+----------+----------+--------
15| 3 || 0 | 0 | 0 | 3 | 0
| 0 0 0 3 || | | | 0 0 0 3 |
| 0 || | | | 0 |
--+----------++---------+----------+----------+----------+--------
EXPECTED INFO needed to classify the sample is:
I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ],
m = 5
s = 16
si = 3,4,4,2,3 (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16
I = -(3/16*lg2(3/16)+4/16*lg2(4/16)+4/16*lg2(4/16)+2/16*lg2(2/16)
+3/16*lg2(3/16))
= -( -.453 -.5 -.5 -.375
-.453 )
= -( -2.281) = 2.281
ENTROPY based on the partition into subsets by B2 is
E(B2)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ] where
Ij = I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj|
Since m=5, the sij's are:
j=1 j=2 j=3 j=4 j=5
--- --- --- --- ---
0 0 0 0 3 <-- s1j
0 2 2 0 0 <-- s2j
2 2 0 0 0 <-- s3j
0 0 0 1 1 <-- s4j
0 0 0 3 0 <-- s5j
--- --- --- --- ---
2 4 2 4 4 <- s1j+..+s5j
j=1 j=2 j=3 j=4 j=5
--- --- --- --- ---
2 4 2 4 4 <- |Aj|
where Aj's are the rootcounts of P2(aj)'s.
Therefore,
j=1 j=2 j=3 j=4 j=5
--- --- --- --- ---
0 0 0 0 .75 <- p1j
0 .5 .5 0 0 <- p2j
1 .5 0 0 0 <- p3j
0 0 0 .25 .25 <- p4j
0 0 0 .75 0 <- p5j
and
j=1 j=2 j=3 j=4 j=5
--- --- --- --- ---
0* 0 0 0 -.311 <- p1j*log2(p1j)
0 -.5 -.5 0 0 <- p2j*log2(p2j)
0 -.5 0 0 0 <- p3j*log2(p3j)
0 0 0 -.5 -.5 <- p4j*log2(p4j)
0 0 0 -.311 0 <- p5j*log2(p5j)
-- --- --- ----- ----
0 1 -.5 .811 .811 <- I(s1j..s5j)
2 4 4 4 4 <- s1j+..+s5j
so that,
0 .25 -.125 .203 .203 (s1j+..+s5j)*
I(s1j..s5j)/16
.531 E(B2)
2.281 I(s1..sm)
GAIN(B2) - > 1.750 I(s1..sm)-E(B2)
NOTE: ONE CAN GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)
Footnote * (If pij = 0 why is p1j*log2(p1j) = 0?
Hint: L'Hospital's Rule)
Continuing with B3
---------------------------------------------------------------
Take B3 = {a1,a2,a3} = {4,5,8} as the 2nd candidate attribute.
Aj={t:t(B3)=aj}, where a1=0100, a2=0101, a3=1000,
sij is number of samples of class, Ci, in a subset, Aj.
so sij = rc( P1(ci)^P2(aj) ), where ci is in {2,3,7,10,15}
and aj is in {4,5,8}.
++---------+----------+----------+
|| P3(4) | P3(5) | P3(8) |
|| 6 | 2 | 8 |
-------------|| 0 2 0 4 | 0 2 0 0 | 4 0 4 0 |
ci| P1(ci) || 02 | 13 | |
==+==========++=========+==========+==========+
2| 3 || 0 | 0 | 3 |
| 0 0 3 0 || | | 0 0 3 0 |
| 3 || | | 3 |
--+----------++---------+----------+----------+
3| 4 || 0 | 0 | 4 |
| 4 0 0 0 || | | 4 0 0 0 |
| || | | |
--+----------++---------+----------+----------+
7| 4 || 2 | 2 | 0 |
| 0 4 0 0 || 0 2 0 0 | 0 2 0 0 | |
| || 02 | 13 | |
--+----------++---------+----------+----------+
10| 2 || 1 | 0 | 1 |
| 0 0 1 1 || 0 0 0 1 | | 0 0 1 0 |
| 3 0 || 0 | | 3 |
--+----------++---------+----------+----------+
15| 3 || 3 | 0 | 0 |
| 0 0 0 3 || 0 0 0 3 | | |
| 0 || 0 | | |
--+----------++---------+----------+----------+
EXPECTED INFO needed to classify the sample is the same as above:
I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ],
m = 5
s = 16
si = 3,4,4,2,3 (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16
I = -(3/16*lg2(3/16)+4/16*lg2(4/16)+4/16*lg2(4/16)+2/16*lg2(2/16)
+3/16*lg2(3/16))
= -( -.453 -.5 -.5 -.375
-.453 )
= -( -2.281) = 2.281
ENTROPY based on the partition into subsets by B3 is
E(B3)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ] where
I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj|
The sij's are:
j=1 j=2 j=3
--- --- ---
0 0 3 <-- s1j
0 0 4 <-- s2j
2 2 0 <-- s3j
1 0 1 <-- s4j
3 0 0 <-- s5j
--- --- ---
6 2 8 <- s1j+..+s5j
6 2 8 <- |Aj| (divisors)
0 0 .375 <- p1j
0 0 .5 <- p2j
.67 1 0 <- p3j
.167 0 .125 <- p4j
.5 0 0 <- p5j
0 0 -.531 <- p1j*log2(p1j)
0 0 -.5 <- p2j*log2(p2j)
-.387 0 0 <- p3j*log2(p3j)
-.431 0 -.375 <- p4j*log2(p4j)
-.5 0 0 <- p5j*log2(p5j)
-- --- ---
1.318 0 1.406 <- I(s1j..s5j)=- sum of above
3 2 8 <- s1j+..+s5j
.247 0 .703 <- (s1j+..+s5j)*I(s1j..s5j)/16
.950 <- E(B3) (sum of above)
2.281 <- I(s1..sm)
GAIN(B3) > 1.331 <- I(s1..sm) - E(B3)
Continuing with B4=A={a1..av} used to classify S into {A1..Sv},
---------------------------------------------------------------
Take B4 = {a1,a2} = {11,15} as the 3rd candidate attribute.
Aj={t:t(B4)=aj}, where a1=1101, a2=1111
sij is number of samples of class, Ci, in a subset, Aj.
so sij = rc( P1(ci)^P2(aj) ), where ci is in {2,3,7,10,15}
and aj is in {11,15}.
++---------+----------+
|| P4(11) | P4(15) |
|| 11 | 5 |
-------------|| 3 4 0 4 | 1 0 4 0 |
ci| P1(ci) || 1 | 1 |
==+==========++=========+==========+
2| 3 || 0 | 3 |
| 0 0 3 0 || | 0 0 3 0 |
| 3 || | 3 |
--+----------++---------+----------+
3| 4 || 3 | 1 |
| 4 0 0 0 || 3 0 0 0 | 1 0 0 0 |
| || 1 | 1 |
--+----------++---------+----------+
7| 4 || 4 | 0 |
| 0 4 0 0 || 0 4 0 0 | |
| || | |
--+----------++---------+----------+
10| 2 || 1 | 1 |
| 0 0 1 1 || 0 0 0 1 | 0 0 1 0 |
| 3 0 || 0 | 3 |
--+----------++---------+----------+
15| 3 || 3 | 0 |
| 0 0 0 3 || 0 0 0 3 | |
| 0 || 0 | |
--+----------++---------+----------+
EXPECTED INFO needed to classify the sample is the same as above:
I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ],
m = 5
s = 16
si = 3,4,4,2,3 (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16
I = -(3/16*lg2(3/16)+4/16*lg2(4/16)+4/16*lg2(4/16)+2/16*lg2(2/16)
+3/16*lg2(3/16))
= -( -.453 -.5 -.5 -.375
-.453 )
= -( -2.281) = 2.281
ENTROPY based on the partition into subsets by B4 is
E(B4)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ] where
I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj|
The sij's are:
j=1 j=2
--- ---
0 3 <-- s1j
3 1 <-- s2j
4 0 <-- s3j
1 1 <-- s4j
3 0 <-- s5j
--- ---
11 5 <- s1j+..+s5j
11 5 <- |Aj| (divisors)
0 .6 <- p1j
.273 .2 <- p2j
.364 0 <- p3j
.091 .2 <- p4j
.273 0 <- p5j
0 -.442 <- p1j*log2(p1j)
-.511 -.464 <- p2j*log2(p2j)
-.531 0 <- p3j*log2(p3j)
-.315 -.464 <- p4j*log2(p4j)
-.511 0 <- p5j*log2(p5j)
-- ---
1.868 1.37 <- I(s1j..s5j)= - sum of above
11 5 <- s1j+..+s5j
1.284 .428 <- (s1j+..+s5j)*I(s1j..s5j)/16
1.712 <- E(B4) (sum of above)
2.281 <- I(s1..sm)
GAIN(B4) > .568 <- I(s1..sm) - E(B4)
and
GAIN(B3) > 1.331 <- I(s1..sm) - E(B3)
GAIN(B2) > 1.750 <- I(s1..sm) - E(B2)
Thus we select B2 as the first level decision attribute.
NOTE: WE GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)
4. Branches are created for each value of B2 and samples are
partitioned accordingly (If a partition is empty, generate a
leaf and label it with the most common class, C2,
labeled with 0011).
.--- B2=0000 - > C2:0011
|--- B2=0001 - > C2:0011
|--- B2=0010 - > Sample_Set_1
|--- B2=0011 - > Sample_Set_2
|--- B2=0100 - > C2:0011
|--- B2=0101 - > C2:0011
|--- B2=0110 - > C2:0011
B2 --|--- B2=0111 - > Sample_Set_3
|--- B2=1000 - > C2:0011
|--- B2=1001 - > C2:0011
|--- B2=1010 - > Sample_Set_4
|--- B2=1011 - > Sample_Set_5
|--- B2=1100 - > C2:0011
|--- B2=1101 - > C2:0011
|--- B2=1110 - > C2:0011
`--- B2=1111 - > C2:0011
Sample_Set_1
X-Y B1 B3 B4
0,3 0111 0101 1011
1,3 0111 0101 1011
Sample_Set_2
X-Y B1 B3 B4
0,1 0011 1000 1111
0,2 0111 0100 1011
1,1 0011 1000 1011
1,2 0111 0100 1011
Sample_Set_3
X-Y B1 B3 B4
0,0 0011 1000 1011
1,0 0011 1000 1011
Sample_Set_4
X-Y B1 B3 B4
2,2 1010 0100 1011
2,3 1111 0100 1011
3,2 1111 0100 1011
3,3 1111 0100 1011
Sample_Set_5
X-Y B1 B3 B4
2,0 0010 1000 1111
2,1 0010 1000 1111
3,0 0010 1000 1111
3,1 1010 1000 1111
NOTE WE DONT NEED TO LIST OUT THE SAMPLE_SETS IN ORDER TO CONTINUE
5. The Algorithm recurses to form decision tree for the samples at
each partition. Once an attribute is the decision attribute at
a node, it is not considered further.
6. Stop when:
a. All samples for a given node belong to the same class or
b. no remaining attributes
(label leaf with majority class among the samples)
We note all samples belong to the same class for nodes:
Sample_Set_1, B2=0010, have class, C3:0111.
Sample_Set_3, B2=0111, have class, C2:0011.
NOTE: ONE CAN GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)
One can determine that these Sample_Sets contain only one
B1 value (class label) from the Ptrees already computed:
++---------+----------+----------+----------+--------
|| P2(2) | P2(3) | P2(7) |P2(10) |P2(11)
|| 2 | 4 | 2 | 4 | 4
--.----------|| 0 2 0 0 | 2 2 0 0 | 2 0 0 0 | 0 0 0 4 | 0 0 4 0
ci| P1(ci) || 13 | 1302 | 02 | |
==+==========++=========+==========+==========+==========+========
2| 3 || 0 | 0 | 0 | 0 | 3
| 0 0 3 0 || | | | | 0 0 3 0
| 3 || | | | | 3
--+----------++---------+----------+----------+----------+--------
3| 4 || 0 | 2 | 2 | 0 | 0
| 4 0 0 0 || | 2 0 0 0 | 2 0 0 0 | |
| || | 13 | 02 | |
--+----------++---------+----------+----------+----------+--------
7| 4 || 2 | 2 | 0 | 0 | 0
| 0 4 0 0 || 0 2 0 0 | 0 2 0 0 | | |
| || 13 | 02 | | |
--+----------++---------+----------+----------+----------+--------
10| 2 || 0 | 0 | 0 | 1 | 1
| 0 0 1 1 || | | | 0 0 0 1 | 0 0 1 0
| 3 0 || | | | 0 | 3
--+----------++---------+----------+----------+----------+--------
15| 3 || 0 | 0 | 0 | 3 | 0
| 0 0 0 3 || | | | 0 0 0 3 |
| 0 || | | | 0 |
--+----------++---------+----------+----------+----------+--------
Thus the decision tree becomes:
.--- B2=0000 - > C2:0011
|--- B2=0001 - > C2:0011
|--- B2=0010 - > C3:0111
|--- B2=0011 - > Sample_Set_2
|--- B2=0100 - > C2:0011
|--- B2=0101 - > C2:0011
|--- B2=0110 - > C2:0011
B2 --|--- B2=0111 - > C2:0011
|--- B2=1000 - > C2:0011
|--- B2=1001 - > C2:0011
|--- B2=1010 - > Sample_Set_4
|--- B2=1011 - > Sample_Set_5
|--- B2=1100 - > C2:0011
|--- B2=1101 - > C2:0011
|--- B2=1110 - > C2:0011
`--- B2=1111 - > C2:0011
Sample_Set_2 (for B2=0011)
X-Y B1 B3 B4
0,1 0011 1000 1111
0,2 0111 0100 1011
1,1 0011 1000 1011
1,2 0111 0100 1011
Sample_Set_4 (for B2=1010)
X-Y B1 B3 B4
2,2 1010 0100 1011
2,3 1111 0100 1011
3,2 1111 0100 1011
3,3 1111 0100 1011
Sample_Set_5 (for B2=1011)
X-Y B1 B3 B4
2,0 0010 1000 1111
2,1 0010 1000 1111
3,0 0010 1000 1111
3,1 1010 1000 1111
Recursing the algorithm on Sample_Set_2 (B2=0011):
1. Subtree starts as single node, S = Sample_Set_2 (determined
by B2=0011, so that ANDing with P2(3) gives correct counts).
2. Not all samples are in the same class (same B1-value),
3. So, use entropy-based measure, "information gain" as a
heuristic for selecting the attribute that will best separate
the samples into individual classes
NOTE: WE CAN GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)
We don't have to rescan the training_set to form the leaf
subsample_sets. We can just use the P-tree sets for those samples
That solves the problem (see 1. above).
Revising from 4. onward then (and expressing SubSampleSets as
revised P-trees):
.--- B2=0000 - > C2:0011
|--- B2=0001 - > C2:0011
|--- B2=0010 - > C3:0111
|--- B2=0011 - > Sample_Set_2
|--- B2=0100 - > C2:0011
|--- B2=0101 - > C2:0011
|--- B2=0110 - > C2:0011
B2 --|--- B2=0111 - > C2:0011
|--- B2=1000 - > C2:0011
|--- B2=1001 - > C2:0011
|--- B2=1010 - > Sample_Set_4
|--- B2=1011 - > Sample_Set_5
|--- B2=1100 - > C2:0011
|--- B2=1101 - > C2:0011
|--- B2=1110 - > C2:0011
`--- B2=1111 - > C2:0011
For Sample_Set_2 (for B2=0011=3) (only 2 classes have count>0)
-------------++-------+-------+----------++---------+---------.
\ ||P3(4) |P3(5) |P3(8) ||P4(11) |P4(15) |
\ ||6 |2 | 8 ||11 |5 |
`-------.||0 2 0 4|0 2 0 0| 4 0 4 0 ||3 4 0 4 |1 0 4 0 |
\| 02 | 13 | ||1 |1 |
ci|P1(ci)^P2(3)=======+=======+==========++=========+=========|
3| 2 ||0 |0 | 0 ||1 |0 |
| 2 0 0 0 || | | ||1 0 0 0 | |
| 13 || | | ||1 | |
--+----------++-------+-------+----------++---------+---------|
7| 2 ||2 |0 | 0 ||2 |0 |
| 0 2 0 0 ||0 2 0 0| | ||0 2 0 0 | |
| 02 || 02 | | || 02 | |
--+----------++-------+-------+----------++---------+---------
EXPECTED INFO needed to classify the sample:
I = I(s1,s2) = -SUM(i=1,2)[ pi * log2(pi) ],
m = 2 s = 16
si = 2,2 (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 1/8, 1/8
I = -(2/16*lg2(2/16) + (2/16*lg2(2/16))
= -( -.375 -.375 ) = .750
________________________________________________________
ENTROPY based on the partition into subsets by B3 is
Take B3 = {a1,a2,a3} = {4,5,8} as the 1st candidate attribute.
Aj={t:t(B3)=aj}, where a1=0100, a2=0101, a3=1000,
sij is number of samples of class, Ci, in a subset, Aj.
sij=rc(P1(ci)^P2(aj)) where ci in {3,7} and aj in {4,5,8}
ENTROPY based on the partition into subsets by B3 is
E(B3)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ] where
I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj|
The sij's are:
j=1 j=2 j=3
--- --- ---
0 0 0 <-- s1j
2 0 0 <-- s2j
--- --- ---
2 0 0 <- s1j+..+s3j
2 0 0 <- |Aj| (divisors)
0 undefined undefined <- p1j
1 undefined undefined <- p2j
(the undefined terms are dropped)
0 <- p1j*log2(p1j)
0 <- p2j*log2(p2j)
-- --- ---
0 <- I(s1j....s3j)=- sum of above
2 0 0 <- s1j+..+s3j
0 0 0 <- (s1j+..+s3j)*I(s1j..s3j)/16
0 <- E(B3) (sum of above)
.75 <- I(s1..sm)
GAIN(B3) = .75 <- I(s1..sm) - E(B3)
Continuing with B4=A={a1..av} used to classify S into {A1..Sv},
---------------------------------------------------------------
Take B4 = {a1,a2} = {11,15} as the 2nd candidate attribute.
Aj={t:t(B4)=aj}, where a1=1101, a2=1111
sij is number of samples of class, Ci, in a subset, Aj.
so sij = rc( P1(ci)^P2(aj) ), where ci is in {3,7}
and aj is in {11,15}.
The sij's are:
j=1 j=2
--- ---
1 0 <-- s1j
2 0 <-- s2j
--- ---
3 0 <- s1j+s2j
3 0 <- |Aj| (divisors)
.33 undefined <- p1j
.67 undefined <- p2j
(the undefined terms are dropped)
-.541 <- p1j*log2(p1j)
-.387 <- p2j*log2(p2j)
-- --- ---
.928 <- I(s1j,s2j)=- sum of above
3 0 0 <- s1j+s2j
.174 0 0 <- (s1j+s2j)*I(s1j,s2j)/16
.174 <- E(B3) (sum of above)
.75 <- I(s1..sm)
GAIN(B4) = .576 <- I(s1..sm) - E(B4)
GAIN(B3) = .75 <- I(s1..sm) - E(B3)
So B3 is the decision attribute and so forth.
Note that no database scan has been needed at all!
ID3 DTI
Bayesian Classification
Bayesian classifiers are statistical classifiers
7.4.1 Bayes Theorem
Let X be a data sample whose class label is unknown.
Let H be a hypothesis (ie, X belongs to class, C).
P(H|X) is the posterior probability of H given X.
P(H) is the prior probability of H.
Bayes Theorem:
P(H|X) = P(X|H)P(H)/P(X)
7.4.2 Naive Bayesian Classification
1. Each data sample is represented by feature vector, X=(x1..,xn)
depicting the measurements made on the sample from A1,..An, resp.
2. Given classes, C1,...Cm, the naive Bayesian Classifier will
predict unknown data sample, X (with no class label), belongs to
class, Cj (called the maximum posteriori hypothesis), having the
highest posterior probability, conditioned on X
( P(Cj|X) > P(Ci|X), i not j).
P(Cj|X) = P(X|Cj)P(Cj)/P(X)
3. P(X) is constant for all classes so we maximize P(X|Cj)P(Cj).
If we assume equal liklihood of classes, maximize P(X|Cj),
else P(Ci) estimated as si/s.
From the PC-cube we see that s is the overall tuple count
and si is the rootcount of DRollup[Bcube->C]i
(thus, it is rc(VPCtree[Ci]) assuming C=Bn = rc(PCn1* AND
... AND PCnm* where there Ci is m-bit string and there is
a * for each 0 bit in the string)
4. To reduce the computational complexity of calculating all
P(X|Cj)'s the naive assumption of conditional independence of
values is often made (therefore the name "Naive Baysian"),
thus, P(X|Ci)=P(xk|Ci)*..*P(xn|Ci).
For categorical attributes, P(xk|Ci)=sixk/si where sixk= # of
training samples of class, Ci, having Ak-value xk
(PCn,Ci ^ PCk,xk, which is one AND program).
For continuous attributes, use Gaussian distribution to estimate
P(xk|Ci).
Once the P(xk|Ci)'s are estimated, the model is "trained".
Example:
Consider the training set, S, where B1 is the class label attribue
S:
B1 B2 B3 B4
0011 0111 1000 1011
0011 0011 1000 1111
0111 0011 0100 1011
0111 0010 0101 1011
0011 0111 1000 1011
0011 0011 1000 1011
0111 0011 0100 1011
0111 0010 0101 1011
0010 1011 1000 1111
0010 1011 1000 1111
1010 1010 0100 1011
1111 1010 0100 1011
0010 1011 1000 1111
1010 1011 1000 1111
1111 1010 0100 1011
1111 1010 0100 1011
__C1___ __C2___ __C3___ __C4___ __C5___
P1,0010 P1,0011 P1,0111 P1,1010 P1,1111
3 4 4 2 3
0 0 3 0 4 0 0 0 0 4 0 0 0 0 1 1 0 0 0 3
P2,0010 P2,0011 P2,0111 P2,1010 P2,1011
2 4 2 4 4
0 2 0 0 2 2 0 0 2 0 0 0 0 0 0 4 0 0 4 0
s1x2=0010 s1x2=0011 s1x2=0111 s1x2=1010 s1x2=1011
0 0 0 0 1 <-- s1x2/s1
0 .5 .5 0 0 <-- s2x2/s2
.5 .5 0 0 0 <-- s3x2/s3
0 0 0 .5 .5 <-- s4x2/s4
0 0 0 1 0 <-- s5x2/s5
__C1___ __C2___ __C3___ __C4___ __C5___
P1,0010 P1,0011 P1,0111 P1,1010 P1,1111
3 4 4 2 3
0 0 3 0 4 0 0 0 0 4 0 0 0 0 1 1 0 0 0 3
P3,0100 P3,0101 P3,1000
6 2 8
0 2 0 4 0 2 0 0 4 0 4 0
s1x3=0100 s1x3=0101 s1x3=1000
0 0 1 <-- s1x3/s1
0 0 1 <-- s2x3/s2
.5 .5 0 <-- s3x3/s3
.5 0 .5 <-- s4x3/s4
1 0 0 <-- s5x3/s5
__C1___ __C2___ __C3___ __C4___ __C5___
P1,0010 P1,0011 P1,0111 P1,1010 P1,1111
3 4 4 2 3
0 0 3 0 4 0 0 0 0 4 0 0 0 0 1 1 0 0 0 3
P4,1011 P4,1111
11 5
3 4 0 4 1 0 4 0
s1x4=1011 s1x4=1111
0 1 <-- s1x4/s1
.75 .25 <-- s2x4/s2
1 0 <-- s3x4/s3
.5 .5 <-- s4x4/s4
1 0 <-- s5x4/s5
5. In order to classify an unknown sample, X, P(X|Ci)P(Ci) is
evaluated for each i, then X is assigned to the class for which
it is maximum. ( Evaluate, P(xk|Ci)*..*P(xn|Ci) * P(Ci) )
s1x2=0010 s1x2=0011 s1x2=0111 s1x2=1010 s1x2=1011
0 0 0 0 1 <-- s1x2/s1
0 .5 .5 0 0 <-- s2x2/s2
.5 .5 0 0 0 <-- s3x2/s3
0 0 0 .5 .5 <-- s4x2/s4
0 0 0 1 0 <-- s5x2/s5
s1x3=0100 s1x3=0101 s1x3=1000
0 0 1 <-- s1x3/s1
0 0 1 <-- s2x3/s2
.5 .5 0 <-- s3x3/s3
.5 0 .5 <-- s4x3/s4
1 0 0 <-- s5x3/s5
s1x4=1011 s1x4=1111
0 1 <-- s1x4/s1
.75 .25 <-- s2x4/s2
1 0 <-- s3x4/s3
.5 .5 <-- s4x4/s4
1 0 <-- s5x4/s5
sixk/si's: si/s P(X|Ci)=P(Ci)
x2 x3 x4 ------------ ---- -------------
Take X= 0011 1000 1011 0 1 0 3/16 0
1/2 1 3/4 4/16 3/32
1/2 0 1 4/16 0
0 1/2 1/2 2/16 0
0 0 1 3/16 0
So X is classified as C2.
So we see that, once the conditional probabilities, sixk/si, are
derived from the P-trees, any new sample can be classified
instantly.
How effective are Naive Bayesian Classifiers?
- In theory they have low error rates in comparison to other
classifiers.
- in practice it is not always true, because the assumptions
may not be valid.
- Various empirical studies have found Naive Bayesian
Classifiers to be comparable to decision tree and neural
network classifiers in many domains.
- They also provide a theoretical justification for other
classifiers that do not explicitly use Bayes Theorem
(e.g., under certain assumptions it can be shown that NN and
curve-fitting algorithms (eg, ID3) output the "maximum
posteriori hypothesis" as does the Naive Bayesian Classifier.
7.4.3 Bayesian Belief Networks (to handle cases where the naive
assumption doesn't hold)
- The Naive Assumption of "class conditional independence"
(given the class label of a sample, the values of the
attributes are conditionally independent of one another)
which allows use of the simplifying formula:
P(X|Ci)=P(xk|Ci)*..*P(xn|Ci), when true, produces the most
accurate classifier of all.
- In practice dependencies can exist between attributes
(variables).
- In spatial datasets, one approach would be to select out
attributes that are independent and then use Naive
Bayesian Classifiers. (e.g., select RIR and leave out
G since there is correlation between them)
- However, with PC-trees we have a way to calculate P(X|Ci)
directly.
It is the AND of the tuple PC-tree for X with the value
PC-tree for Ci (noting that X is a tuple in Rel[X]
not Rel - eliminating Coord and C)
- Bayesian Belief Networks specify the joint conditional
probabilities and allow class conditional independence
to be defined between subsets of attributes (variables)
namely those subsets that are conditionally
independent of oneanother.
- Note that the notion of functional dependence in
normalizing relations
is a specification of conditional dependence.
A Belief Network (or Bayesian Belief Network or Bayesian
Network or Probabilistic Network) is composed of two
components,
1. an acyclic directed graph (nodes=attributes or random
variables; edges=variables (actual attributes or "hidden
variables" such as medical syndrome in medical data)
- each variable is conditionally independent of its
non-descendents, given its parents.
2. a Conditional Probability Table (CPT) for each variable,
Z, specifying all P(Z|parentZ).
7.4' Non-Naive Baysian Classifier (New section, shortcut to
Baysian Belief Net for spatial data with Ptrees).
We can use Baysian Classification directly without the Naive
assumption, since we do not need to make the simplifying Naive
assumption that P(X|Ci)=P(xk|Ci)*..*P(xn|Ci) since we can
compute the actual P(X|Ci) directly (in fact it is a simpler
program than the above) as: TPC(X) ^ VPC(Ci).
We do not need Baysian belief networks to estimate these numbers!
Bayesian Classifiers
Classification by Backpropagation
- A Neural network is a set of connected input/output units
- Each connection has a weight
- In the learning phase, adjusts weights to learn to
predict class of input samples.
- Backpropagation is a particular Neural Network learning alg
- It operates on a "multilayer feed-forward network"
Multi-layer Feed-Forward Neural Network
Input Hidden Output
layer layer layer
.----. .----. .----.
x1 | |----------| |----------| |- >
`----'-. .-`----'-. .-`----'
\ `-..-' / \ `-..-' /
.----..\-' `-/..----..\-' `-/..----.
x2 | |--\----/--| |--\----/--| |- >
`----'\ \ / /`----'\ \ / /`----'
. \ \/ / . \ \/ / .
`./\.' `./\.'
. /\/\ . /\/\ .
/ /\ \ / /\ \
. /.' `.\ . /.' `.\ .
.----./' `\.----./' `\.----.
xi | |----------| |----------| |- >
`----' wij `----' wjk `----'
Oj Ok
- Inputs correspond to attributes from training samples.
- Weighted outputs of the Input units are fed to the Hidden
units (many Hidden layers?).
- Weighted outputs of last Hidden layer's units are fed to the
Output units.
- Output units emit the network's prediction for the given
samples.
- Hidden and Output units are often referred to as "neurodes"
- A n-layer NN has n layers other than the Input layer
(includes Hidden and Output).
- NN is "feed-forward" since none of the weights cycle back to
an input unit or output unit of a previous layer.
Defining a Network Topology
- Specify the number of Input units
- Specify the number of Hidden layers
- Specify the number of Hidden units in each Hidden layer
- Specify the number of Output units
- Normalize the input values for each attribute in training
set speeds up training.
Backpropagation
- learns by iteratively,
- processing a set of training samples,
- comparing the network's prediction for each sample with
the actual known class label.
- For each training sample, weights are modified to minimize
mean-square error between the network's prediction and the
actual class.
- These modifications made in a "backwards" direction, from
Output layer, through each Hidden layer to the Input layer
- The weights will (usually) eventually converge, and the
learning process stops.
The Backpropagation Algorithm:
(1) Initialize all weights and biases in "network";
(2) while terminating condition is not satisfied {
(3) for each training sample X in "samples"
(4) // Propagate the inputs forward:
(5) for each hidden or output layer unit j {
(6) Ij = SUM(i)[ wij*Oi + theta(j) ]
//compute the net input of the
unit j wrt previous layer, i//
(7) Oj = 1/(1+e^(-Ij);} //compute the output of
each unit, j//
(8) // Backpropagate the errors://
(9) for each unit j in the output layer
(10) Errj = Oj(1-Oj)(Tj-Oj);
// compute the error with respect to the next higher
layer, k//
(11) for each unit j in the hidden layers, from last to
1st hidden layer
(12) Errj = (l)*Errj*Oi // weight increment
(13) for each weight wij in "network" {
(14) DELTAwij = (l)*Errj*Oj // weight increment
(15) wij = wij + DELTAwij} // weight update
(16) for each bias THETAj in "network" {
(17) DELTA(THETAj) = (l)*Errj; //bias increment
(18) THETAj = THETAj + DELTA(THETAj) } // bias update
(19) }}
(1) The weights in the network are initialized to small
random numbers (eg, -1 to 1 or?).
Each unit has a "bias" also initialized to a small random num.
For the jth layer (as inputs to the jth layer, where the Input
layer is 0) j=1..m (m+1 layers total, including input):
| w11 w12... w1n1 |
| w21 w22... w2n1 |
| w31 w32... w3n1 |
| . | = W1
| . |
| . |
|wn01 wn02.. wn0n1|
| z(1)1 |
| z(1)2 |
| . | = Z1
| . |
| z(1)n1|
etc.
(4)-(7) Net input to Hidden/Output unit,
j: Ij=SUM(i)[wij*Oi+zj] where wij is the weight of the
condition from unit, i, in previous layer to unit, j;
Oi is the output of unit i; zj is the bias of the unit
(threshold -varying activity of the unit)
Each units takes its net input; applies an "activation function"
- symbolized activation of the neuron
- logistic or simoid (or "squashing fctn since it maps
a large input domain into [0,1]) is used: Given net
Input, Ij, to unit j, then the output of unit j
is: O'j = 1/(1+e^-Ij)
- the logistic fctn is nonlinear and differentiable,
allowing backprop algorithm to model classification
problems that are linearly inseparable
So the output, O'j, of unit-j, given
- output from previous-layer, unit-i of Oi,
- connection weight, wij,
- bias zj, is:
(O1 O2..Onj-1) | w11 w12 ... w1nj | + ( z1 z2 ... znj )
= ( I1 ... Inj )
| w21 w22 ... w2nj |
| w31 w32 ... w3nj |
| . |
| . |
| . |
|wnj-11...wnj-1,nj |
and at each layer,
_____________1________________
Oj = f(Ij) = -(SUM(i)[wij*)i+zj]
1 + e
We will write it using matrix motation as follows:
At layer j, the
from previous layer, outputs are Oj-1
weights are Wj
inputs are Ij
outputs are Oj (after applying activation
fctn)
O(j-1)*Wj+Zj => Ij => Oj=f(Ij)
(8)-(18) The error is propagated backwards once the output of
the Output layer is computed, to update weights and biases.
For Output unit, Om, Errm=Om(1-Om)(Tm-Om); Om is "actual"
output and Tm is the "true" output based on the known class
label of the training sample
Noting that for f(x)= 1/(1-e^-x), f'(x)= e^-x / (1+e^-x)^2
and f(x)*(1-f(x))= 1/(1-e^-x) * (1 - 1/(1+e^-x)) =
e^-x / (1+e^-x)^2
we see that we are just using a straight line assumption as to
the input DELTA value since DELTA(x) = y' * DELTA(y),
where y'=Om(1-Om) and DELTA(y) = (Tm-Om)
The error in a Hidden layer-j, use the weighted sum of the errors
of the units connected to j from the next layer:
Errj = Oj(1-Oj)*SUM(k)[Errk*wjk]
where wjk=weight of connection from unit-j to a unit-k in the
next higher layer and Errk is the error of unit-k.
Weights are updated: DELTAwij = (l) * Errj * Oj
and wij = wij * DELTAwij
where l=learning rate, a constant, typically in (0,1).
- Backpropagation learns using a method of gradient descent
to search for a set of weights that can model the given
classification problem so as to minimize the mean squared
distance between the network's class prediction and the
actual class label of samples.
The learning rate helps to avoid getting stuck at a local
minimum in decision space; if too low, learning is very slow.
If too high, thrashing between suboptimals can occur.
A rule of thumb is to set the learning rate to 1/t
where t=number of iterations through the training set so far.
Biases are updated: DELTA(zj) = (l) * Errj
Here we are updating the weights and biases after presentation of
each sample (case updating). Alternatively, weight and bias
updates (DELTAs) can be accumulated in variables so that
updating can be applied after the entire training set has been
presented (epoch updating).
(one iteration through the training set is an epoch)
In theory (mathematical) epoch updating is better, yet in
practice, case updating is more common since it tends to yield
more accurate results.
(2)-(3) Training stops when either
- all DELTAwij in the previous epoch were so small as to be
below some threshold or
- the % of samples misclassified in the previous epoch is
below some threshold or
- a pre-specified number of epochs has expired.
(in practice several hundred thousand epochs may be required.)
Input Hidden Output
.----. .----. .----.
x1 | x1 |----------| X1 |----------| y1 |- > y1
`----'-. .-`----'-. .-`----'
\ `-..-' / \ `-..-' /
.----..\-' `-/..----..\-' `-/..----.
x2 | x2 |--\----/--| X2 |--\----/--| y2 |- > y2
`----'\ \ / /`----'\ \ / /`----'
. \ \/ / . \ \/ / .
`./\.' `./\.'
. /\/\ . /\/\ .
/ /\ \ / /\ \
. /.' `.\ . /.' `.\ .
.----./' `\.----./' `\.----.
xI | xI |----------| XJ |----------| yK |- > yK
`----' wij zj`----' Wjk ZK`----'
(x1..xI)*|w11..w1J|+|z1|=>f=>(X1..XJ)*|W11..W1K|+|Z1|=>f=>(y1..yK)
| . . | |. | |. . | |. |
|wI1..wIJ| |zJ| |WJ1..WJK| |ZK|
**************************************************************
Other Classification Methods
k-nearest Neighbor Classifiers (based on learning by analogy
- unknown samples are assigned to the most common class among
its k-nearest neighbors in n-space.
- instance based.
- lazy or "as you go" learner (by contrast to decision trees where
the classifier is constructed before new samples are considered)
- With respect to spatial data in REL organization, if B1 is the
class label attribute, what should be meant by the k-nearest
ngbrs?
- Let's assume there is one REL dataset for learning and the new
samples are separate from it (e.g., for RGBY data, take the
point of view that we use last years dataset with RGB and Y
to train and are interested in classifying this years RGB data
to predict the Y).
A Spatial k-nearest ngbr algorithm
Assume we have basic Ptrees for the training set.
We find the k-nearest ngbrs to a new sample, x, and then
predict the class of x to be the majority class among those
k ngbrs.
So we will find the closest k (or more) training tuples, based on
a weighted Manhattan distance on the non-class attribute values
(e.g., if B1 is the Class label attribute,
wm_dis(x,y) = SUM(i=2..n)[wi*|yi-xi|], where 0= k done.
(class label is the one that gives the max rootcount when its
Ptree is ANDed with Px - i.e., we compute rc(Px^Pci) for each
class label, ci and assign the one that gives max rootcount.)
2. If rc(Px) < k, remove the lowest-order bit from the
highest-weight band value of x,
(we will call the resulting tuple, x also - since it is just
the original tuple x, with its Bi-value generalized one
level up the value concept hierarchy to a 7-bit value instead
of an 8-bit value).
Repeat 1 and 2 until rc(Px) >= k
(note, when we have removed the low order or 8th bit from all
of the non-class-label attributes of x, we proceed to removing
the 7th bit one attribute at a time, then the 6th bit and so
forth.)
(note, we can decide to remove several bits at a time so as to
reduce the complexity. We may get a ngbr set that has many more
than k ngbrs in it but that shouldn't be a problem. If for some
reason it seems important to get the smallest ngbr set that
qualifies (closest to k) rather than ordering the attributes by
"importance" we could calculate the ngbr set size for each
attribute during each "bit removal pass" and pick the one that
gives us the best ngbrset... Lots of variations are possible.)
(note, while calculating the rc's above it would make sense to
have an accumulator for the rc's for each attribute values
for several of the passes (8-bit, 7-bit, ... values).
This can be done with a single scan parallel program (lots of
accumulators however). This gives us maximum flexibility in
deciding the best ngbrset. We could also be computing the
Px^Pci rootcounts during this one single scan pass).
(Note, in the event that we get through 1-bit values without the
ngbrset reaching size, k, (could that happen?
How? and if so, what could be done about it?) we could make
resort to the traditional training set scan to classify that
particular new sample.)
Example:
Traning Dataset (B1 is the class label attribute and k=5):
X-Y B1 B2 B3 B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011
Consider the new sample is: x = ---- 1011 1000 1111
The basic PC_trees in PQ-list form:
PQ11: 23 3
PQ12: 1 31 32 33
PQ13: pure
PQ14: 0 1 31 32 33
PQ21: 2 3
PQ22: 00 02
PQ23: 0 1 2 3
PQ24: 0 10 12 2
PQ31: 0 2
PQ32: 1 3
PQ33: null
PQ34: 11 13
PQ41: pure
PQ42: 01 2
PQ43: pure
PQ44: pure
Assume the weights order the bands from high-to-low B2,B3,B4
Consider the new sample is: x = ---- 1011 1000 1111
C = {0010 0011 0111 1010 1111} (class labels)
The needed PQ-seq's are:
PQ1,0010: 20 21 22
PQ1,0011: 0
PQ1,0111: 1
PQ1,1010: 23 30
PQ1,1111: 31 32 33
PQ2,1011: 2
PQ3,1000: 0 2
PQ4,1111: 00 02 2
1. If rc(Px) >= k done. (class is s.t. rc(Px^Pci) is max.)
Px: 2 rc(Px)=4 NOT >= k=5
2. If rc(Px) < k, loworder bit from next band value...
Take off the loworder bit from B2:
PQ2,101 : 2 3 (gives the same result for Px so do same with B3)
PQ3,100 : 0 2 (gives the same result)
PQ4,111 : 00 02 2 (gives the same result)
next loworder bit removal:
PQ2,10 : 2 3 (gives the same result for Px so do same with B3)
PQ3,10 : 0 2 (gives the same result)
PQ4,11 : 00 02 2 (gives the same result)
next loworder bit removal:
PQ2,1 : 2 3 (gives the same result for Px so do same with B3)
PQ3,1 : 0 2 (gives the same result)
PQ4,1 : pure
------------
PQx: 2 (gives the same result)
next loworder bit removal:
PQ2,1 : pure
PQ3,1 : 0 2
PQ4,1 : pure
------------
PQx: 0 2 has rc = 8 >= 5.
rc(Px) >= k, class is s.t. rc(Px^Pci) is max.)
PQ1,0010: 20 21 22
PQx: 0 2
--------------
20 21 22 rc= 3
PQ1,0011: 0
PQx: 0 2
--------------
0 rc= 4
PQ1,0111: 1
PQx: 0 2
--------------
null rc= 0
PQ1,1010: 23 30
PQx: 0 2
--------------
23 rc= 1
PQ1,1111: 31 32 33
PQx: 0 2
--------------
null rc= 0
Thus, the class label for x is 0011
*********Notes *********************************************
Problems?
1. Consider the problem of a ngbr that is positioned in
large numbers right near a quadrant boundary, so that
it has ngbrs which don't appear to be ngbrs in the Ptree.
(This may not be a problem, since we are dealing
with whole values. The real problem is 2.)
2. For a value like, 0111, note that it is at the edge,
not the middle of the intervals,
[0110,0111],
[0100,0111],
which are the ngbrhds used when removing the first 2
low-order bits (note that the same thing happens with 1111
but it is inevitably at the edge of all ngbrhds,
while 0111 is not.)
Better, 1st "nbrd" be [0110,1000] = [6,8]
2nd, [0100,1001] = [4,9].
Or even better, 1st: [0110,1000] = [6,8],
2nd: [0101,1001] = [5,9].
[0111,0111] [7,7]
[0110,1000] [6,8]
[0101,1001] [5,9]
Question:
In removing a loworder bit, can it be accomplished by ORing? e.g.,
To get:
PQ2,101 : 2 3
can we just OR:
PQ2,1011: 2
PQ24': 11 13 3
OR---------------
11 13 2 3
where PQ24' is the comp of PQ24: 0 10 12 2
apparently not!
Note:
P2,101 = P2,1010 v P2,1011 = (P2,101 ^ P24') v P2,1011 =
= (P2,101 v P2,1011) ^ (P24' v P2,1011)
= P2,101 ^ (P24' v P2,1011)
It's clear there is no way to construct, e.g., P21 from P2,11
and the basic, P22 or its comp, since P2,11 is 1 where both
P21 and P22 are 1. Knowing where P22 is 1 doesn't tell me
which of the pixels for which P22 is 0 have a 1 in P21.
That is to say, a 0 in P2,11 where P22 is also 0 tells me
nothing about P21 at those pixels (it could be 0 or 1).
Therefore we need to retain all info on a subcube as we go to
avoid further ANDing:
So, to answer the classification question (using our "nearest
ngbr" like approach) we need to have filled in a cube:
Consider, again, the new sample: x = ---- 1011 1000 1111 and
C = {0010 0011 0111 1010 1111} (class labels).
We need the cube bounded by all of 5 B1-values
(the entire B1 dimension) and
P2,
1011 [11,11]
101 1100 [10,12]
1001 1101 [ 9,13]
1 111 [ 8,14]
0111 1111 [ 7,15]
0110 [ 6,15]
0101 [ 5,15]
01 [ 4,15]
0011 [ 3,15]
0010 [ 2,15]
0001 [ 1,15]
Of these, the ones we see in the basic algorithm
(removal of loworder bits) are:
P2,
1011 [11,11]
101 [10,11] not seen above
10 [ 8,11] not seen above
1 [ 8,15] not seen above
If we also include those needed to balance the intervals:
P2,
1011 [11,11]
101 1100 [10,12] not seen above
10 111 [ 8,14] not seen above
1 [ 8,15] not seen above
How far out should the intervals go before we stop
(and consider the new sample an outlier - at which point
we take the majority class of the ngbr-set, if there is
one, else take the majority class of the sample space)?
- One thought would stop after Radius =
ROOF{SQRT(|S|) / ROOF[SQRT(|S|/k)]}
Rationale: If the samples are uniformly distributed with
duplicity=k, each duplicity group would be at and
intersection of grid lines with the above spacing.
- SQRT(|S|/k) = SQRT(16/5) = 1.78, roof is 2.
so R = 4 / 2 = 2
P2,
1011 [11,11]
101 1100 [10,12]
10 111 [ 8,14]
P3,
1000 [ 8, 8]
0111 1001 [ 7, 9]
011 1010 [ 6,10]
P4,
1111 [15,15]
111 1111 [14,15]
11 1111 [13,15]
then once the ngbrset is found, AND with the following
to classify
P1,0010
0011
0111
1010
1111
Misc Classification
Cluster Analysis
What is Cluster Analysis?
- The process of grouping a set of physical or abstract objects
into classes of similar objects.
A Powerpoint presentation on Clustering
What are some typical applications of clustering?
- Business: Help marketers discover distinct groups in their
customer bases and characterize customer groups
- Biology: Derive plant and animal taxonomies;
Categorize genes with similar functionality;
Gain insight into structures inherent in populations;
- Land use: Identify areas of similar land use in an earth
observation database;
- Insurance: Identify groups of houses in a city according
to house type, value, geographic location;
Identify policy holders with average claim costs
- WWW: Classify documents on the web for information discovery.
- Data Mining: Stand-alone tool to gain insight into the
distribution of data,
Observe characterisitics of each cluster;
- Data clustering includes contributions from
data mining,
statistics,
machine learning,
spatial databases,
biology,
marketing.
- As a branch of statistics, cluster analysis has been
studied extensively
- focused mainly on distance-based cluster analysis,
- tools are built into S-Plus, SPSS, SAS
- In machine learning, cluster analysis is an example of
unsupervised learning.
- does not rely on predefined classes or class-labeled training
examples
- form of learning by observation, rather than learning by
example,
- In conceptual clustering, a group of objects
forms a class only if it is describable by a concept.
- differs from conventional clustering, which measures
similarity, based on distance.
- Conceptual Clustering consists of two components:
(1) it discovers the appropriate classes
2) it forms descriptions for each class, as in classification
- The guideline of striving for high interclass similarity
and low interclass similarity still applies.
- In data mining, cluster analysis has focused on:
- finding methods for efficient and effective cluster
analysis in large databases,
- scalability of clustering methods,
- effectiveness of methods for clustering complex shapes
and types of data,
- high-dimensional clustering techniques,
- methods for clustering mixed numerical and categorical data
in large DBs
- The following are typical requirements of clustering
in data mining:
- Scalability (cluster larger datasets in reasonable time?)
- Deal with different types of attributes
(binary, categorical (nominal), ordinal, mixtures)
- Discovery of clusters with arbitrary shape
(Euclidean or Manhattan distance produces
spherical clusters with similar size and density)
What about arbitrary shapes?
("spatial clustering" deals with shaped clusters)
- Minimal requirements for domain knowledge to determine
input parameters: (parameters such as # of clusters may
be hard to determine apriori) Cluster algorithm should
be robust and insensitive wrt to the inputs
- Ability to deal with noisy data
(a "noise" point is also called an "outlier")
(insensitivity to outliers, missing data,
unknowns, erroneous data..)
- Insensitivity to order of input records
- High dimensionality (human eye not good at judging
cluster quality for more than 3 dimensions)
- Contraint-based clustering (there may be side
constraints as well as a "distance"
("spatial clustering" deals with side conditions also)
- Interpretability and usability (user expect interpretable
comprehensible an usable results)
- Study of clustering methods proceeds as follows:
present general categorizatoin of clustering methods,
study each method in detail, including methods based on
partitioning,
hierarchical,
density-based,
grid-based,
model-based
examine high-dimensionality and do outlier analysis
Types of Data That Occur in Cluster Analysis
(and how to preprocess them)
- Suppose dataset to be clustered contains n objects,
which may represent persons, houses, documents,
countries, pixels, genes...
- Clustering algorithms typically operate on either a
"data matrix" or a
"dissimilarity matrix"
- Data Matrix (or object-by-variable structure or "two mode"):
- represents n objects (persons?) by p variables
(measurements or attributes)
(such as height, weight, gender, race...).
The structure is in the form of a relational table or
n-by-p matrix (n objects, p variables)
- in our spatial DM the objects are pixels and the
attributes are bands.
x11 x12 ... x1p
x21 x22 ... x2p
. . .
: : :
xn1 xn2 ... xxp
- This the the relational or table view of the data,
R(K,A1,...,Ap) where K is a key id attribute to identify
objects uniquely and each Ai is a column in the Matrix.
- in spatial DM this is just the REL organization in which
each row is a tuple corresponding to a particular pixel.
- Dissimilarity Matrix
(or object-by-object structure or "one mode"):
- Stories collection of proximities available for all
pairs of n objects. Often represented by an n-by-n table:
0
d(2,1) 0
. .
: :
d(n,1) d(n,2)... 0
- Where d(i,j) is measured difference or dissimilarity
between objects i & j.
- d(i,j) is a non-neg number close to 0 when objects
are similar or "near".
- in our precision ag example, a distance measure might be:
the distance between two tuples, t and t', is
|2*t.Y + t.SM + t.N - (2*t'.Y + t'.SM + 't.N)|
This was used, essentially, by Kaushik Das in his thesis
work. We will look at his clustering software later
(based on SOMs and NNs).
- Many clustering algorithms operate on a dissimilarity
matrix, but a data matrix can be transformed into a
dissimilarity matrix.
- in the spatial setting, this can be a prohibitively
large matrix:
- For a TM scene, it is ~(40,000,000)^2 /2 or
800,000,000,000,000 cells (800 trillion!)
Interval-scaled Variables (continuous of linear scale:
weight,height,lat,lon,..)
- units can effect clustering results (inches versus meters)
- smaller units lead to larger ranges for that variable and
therefore larger clustering effect for that variable.
- To avoid units effects, data should be standardized
- convert to "unitless" measurements by:
(1) Calculate the mean absolute deviation for a variable
(attribute), f,
sf=(|x1f-mf|+|x2f-mf|+...+|xnf-mf|)/n
mf=mean of f = (x1f+..+xnf)/n
(2) Calculate the standardized measurement, or z-score:
zif=(xif-mf)/sf sf=mean absolute deviation
(dis from mean is not squared)
(3) median absolute deviation...
- Once standardized, similarity/dissimilarity calculated based
on "distance":
(1) Euclidean: d(i,j)=SQRT(|xi1-xj1|^2 +..+ |xip-xjp|^2)
(2) Manhattan: d(i,j)= (|xi1-xj1| +..+ |xip-xjp| )
Both are reflexive ( d(i,i)=0 ),
symmetric ( d(i,j)=d(j,i) ) and
subtransitive ( satisfy triangle inequality
d(i,j) <= di,h)+d(h,j) )
(3) Minkowski (generalization of both):
d(i,j)=(|xi1-xj1|^q +..+ |xip-xjp|^q)^1/q
(4) weighted Minkowski:
d(i,j)=(w1*|xi1-xj1|^q +..+ wp*|xip-xjp|^q)^1/q
A Categorization of Major Clustering Methods
Partitioning Methods
- Given a DB or n objects (tuples),
a partitioning method constructs k
partitions of the data, where each
partition represents a cluster and k<=n
- ie, classify into k groups, that together satisfy:
(1) each group must contain >=1 object,
(2) each object must belong to 1 group
(partitions are mutially exclusive and
collectively exhaustive)
(can be relaxed to a fuzzy partition)
- Given k (# partitions to construct) create initial partition,
then use iterative relocation technique that attempts to
improve partitioning by moving objects.
- General criteria for good partitioning is that same-cluster
objects are "close" and different-cluster objects are
"far apart"
- To achieve global optimality would require exhaustive
enumeration of all posssibilities.
- Heuristics:
(1) k-means algorithm, where each cluster is represented by
the mean value of its objects
(2) k-medoids algorithm, where each cluster is represented by
an object near the center (center= 1st moment - minimizes
the sum of the distances from it to its cluster mates.
center = 2nd moment, etc.)
- Works well finding spherical clusters in small-medium
sized datasets.
Hierarchical Methods
- Agglomerative (bottom-up)
(starts with each object in its own cluster)
Divisive (top-down)
(starts with all objects in one cluster)
Agglomerative step0 step1 step2 step3 step4
(AGNES) -----+----------+----------+----------+----------+-- >
a--------.
ab-----------------------------.
b--------' abcde
c-------------------------------cde------'
d-------------------. /
de-------'
e-------------------'
Divisive step4 step3 step2 step1 step0
(DIANA) <-----+----------+----------+----------+----------+----
AGNES (AGglomerative NESting) places each object in its own
cluster initially. 2 clusters are merged iteratively
according to some criterion, usually minimum cluster distance
(see options for distance between 2 clusters below).
DIANA (DIvisive ANAlysis) all objects form one cluster initially.
Clusters are split according some principle, usually maximum
pairwise cluster distance.
In either user can specify desired number of clusters as a
termination condition.
Four widely used cluster distances are:
(where |p-q| is distance between objects)
1. Minimum distance: Dmin(Ci,Cj) = min(p in Ci, q in Cj)|p-q|
2. Maximum distance: Dmax(Ci,Cj) = max(p in Ci, q in Cj)|p-q|
3. Mean distance: Dmean(Ci,Cj) = |mi - mj|
4. Average distance: Davg(Ci,Cj) = 1/(ni*nj)
SUM(p in Ci)SUM(q in Cj)|p-q|
- Suffer from the fact that once a split or merge is done,
it cannot be undone (result in error?).
- Improvements:
(1) perform careful analysis of object linkages at each
hierachical clustering (CURE and Chameleon)
(2) integrate hierarchical agglomerative and iterative
relocation by
1st using a hierarchical agglomerative algorithm & then
refining the result using iterative relatcation (BIRCH)
Density-based methods (non-distance based).
- continue growing the given cluster as long as the density
(# of objects or data points in the nghd)
exceeds some theshold (DBSCAN, OPTICS)
Grid-based Methods (all clustering operations are performed
on a grid structure)
- fast processing indepedent of # of data objects,
and dependent only on # cells in each dimension of
the quantitized space. (STING, CLIQUE, WaveClsuter)
Model-based Methods (hypothesize a model for each of the
clusters, and finds the best fit of the data to the given model.
Should Ptrees lend themselves to a grid-based clustering methods?
(Since the recursive quadrantization is a griding of the space)
or is the griding usually on other than the key attribute?)
However, if we grid (quadrantize) on the other attributes the
resulting structure should serve the grid approach to
clustering well.
- Construct a Ptree of the Pcube?
____________________________
/ / / / /|
3 =11 / / / / / |
/ / / / / |
/______/______/______/______/ |
/ / / / /| |
2 =10 / / / / / | /|
/ / / / / | / |
/______/______/______/______/ |/ |
/ / / / /| / |
1 1 =01 / / / / / | /| /|
d /4------ >5 / / / | / | / |
n /_^____/__.___/______/______/ |/ |/ |
a / : / . / / /| | | |
B 0 =00 / . / etc / / / | /| /| /|
/ . / / / | / | / | / |
/______/_.____/______/______/ |/ |/ |/ |
B | | . | | | | | | /
a 0 = 00 | | . | | | /| /| /| /
n | 0- >1 . | | | / | / | / | /
d |______|/____:|______|______|/ |/ |/ |/
2 | / :| | | | | /
1 = 01 | /| :| | | /| /| /
| 2----- >3| | | / | / | /
|______|______|______|______|/ |/ |/
| | | | | | /
2 = 10 | | | | | /| /
| | | | | / | /
|______|______|______|______|/ |/
| | | | | /
3 = 11 | | | | | /
| | | | | /
|______|______|______|______|/
0 =00 1 =01 2 =10 3 =11 Band3
Gives a Ptree with fanout=8
(focus on the rootcounts of each tree only).
Root
.--------------------------'/// \\\`--------------------------.
/ .-----------------'// \\`-----------------. \
/ / .--------'/ \`--------. \ \
/ / / / \ \ \ \
P(0,0,0) P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1)
///||\\\ ///||\\\ ///||\\\ ///||\\\ ///||\\\ ///||\\\ ///||\\\ ///||\\\
/// || \\\
/// || \\\
.--------------------------'// / \ \\`-------------------.
/ .----------------'/ / \ \`--------------. \
/ / .------' / \ `---. \ P(11,01,01)
/ / / / \ \ \
P(10,00,00)P(10,00,01)P(10,01,00)P(10,01,01) P(11,00,00)P(11,00,01)P(11,01,00)
We certainly can look for grid based clusters
in this tree, but it is LARGE in general.
If we are interested only in "dense clusters" we could place 1 in
a node only the octant has more than, e.g., twice its share
(i.e., at depth-1: more than 1/4 of total count) etc.
- then we have a Boolean tree which should identify clusters.
- how about a 1-bit iff the octant has more than its share??
(compression not as good?)
Partitioning Methods (more detail)
- Given a database with n objects (tuples) and k=# of clusters
to form, a partition aglorithm organizes objects into k
partitions, where each partition represents a cluster.
- The clusters are formed to optimize an objective
partitioning criterion, often called a "similarity function",
(e.g., distance) so that objects within a cluster are similar
and objects of different clusters are dissimilar in terms of
the database attributes.
Classical Partitioning Methods: k-means and k-medoids
The most well-known and commonly used partitioning methods are
these.
Algoritm: (k-means: based on the mean value of the objects in
the cluster)
Input: The number of clusters, k, and a database containing
n objects.
Output: A set of k clusters that minimize the squared-error
criterion.
Method:
(1) arbitrarily choose k objects as the initial cluster centers.
(2) repeat
(3) (re)assign each object to the cluster to which the object
is most similar
based on the mean value of the objects in the cluster;
( using E=SUM(i=1..k)[ SUM(p in Ci)[|p-mi|^2]]
where mi=mean of Ci )
(4) update the cluster means, i.e., calculate the mean value
of the objects for each cluster;
(5) until no change;
P-trees might be used in applying k-means-CPM as follows.
Data:
X-Y B1 B2 B3 B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011
If we consider only 4-bit values and the corresponding P-trees:
P1,0000 P1,0100 P1,1000 P1,1100 P1,0010 P1,0110 P1,1010 P1,1110
0 0 0 0 3 0 2 0
0 0 3 0 0 0 1 1
3 3 0
P1,0001 P1,0101 P1,1001 P1,1101 P1,0011 P1,0111 P1,1011 P1,1111
0 0 0 0 4 4 0 3
4 0 0 0 0 4 0 0 0 0 0 3
--------------