DATA MINING 

 is unstructured querying.


powerpoint presentation on data mining)


The whole point of a database is to residualize relationships among
data items for enterprise use.

Relationships, as studied by ER diagrams and other tools for modeling data
(Data Engineering) are described using relations (or tables).

The relation, R(A1,..An) defined on domains D1,...,Dn is a degree=n
    relationship among values in the n domains

Any relationship can be diagrammed (or pictured) using a graph.

The graph of the relation(ship), R(A1,...,An) is an n-partite undirected graph

	in which the n-way hyper-edges interconnect values from D1,...,Dn.

	This hypergraph is difficult to draw usefully on a 2-D plane
	(sheet of paper), and therefore is seldom attempted.

	However, for a degree=2 relationship, R(A1,A2), drawing the bipartite
	graph is very helpful in understanding and studying the relationship.

		The bipartite graph is often called an x-y scatter plot, where
		x is a variable on D1 and y is a variable on D2 and
  		each plot point represents a related pair (edge in the graph)




The relational model is a horizontal model in which the focus is on the edges

        (horizontal data structure listing the nodes involved in that edge plus,
         possible, node labels and/or edge labels).




A Vertical Model focuses on the nodes.

        E.g., for each D1-entity-node, the D1-centric vertical model
              associates (or bit maps) the set of D2-entity-nodes
              that are related to that D1-node.

        And,  for each D2-entity-node, the D2-centric vertical model
              associates (or bit maps) the set of D1-entity-nodes
              that are related to that D2-node.
       






Market basket data (MBR)




Data is organized vertically as a

TRANSACTION TABLE with 2 attributes:       T(Tid, Itemset)


   - A transaction is a customer transaction at a cash register.

      - Each is given an identifier, Tid.

      - Itemset is the set of items in the customer's "basket".
     

Note that tuples in T are not "flat" (each associated itemset is a "set")
 That's can be problematic for analysis, so typically
 a transformation is made to the dual, the Boolean or Bitmap model:



Market Basket Data, the Boolean or Bitmapped Model:


    Boolean Transaction Table: BT(Tid, Item-1, Item-2,... Item-n)

        Tid is a transaction identifier

        Each Item-i column is a Boolean column (or bit vector) which  indicates
        which items that transactions relates to, by turning on (to 1) only
        those bit positions corresponding to related items.

 	Clearly, we don't want to have to specify the correspondence (map)
        between bit positions and items anew for every column.  Therefore we
        do this mapping once and for all using a Domain Vector Table or DVT.
 
	The DVT need not map the entire domain but only the
        Extant Domain (of all currently existing values)
       

        So a 1-bit means that item is in the market basket and a
             0-bit means that item is not in the market basket.



Again, in any bipartite relationship between two entities, T and I
    (eg, Customer Transactions and purchasable items in Market Basket Research),

there are always two vertical models,

        one in which we focus on D1-entity nodes
        (and list or map, for each, the set of related D2-entity-nodes)

        the other in which we focus on D2-entity nodes
        (and list or map, for each, the set of related D1-entity-nodes)


Thus in MBR, one has the dual vertical model, I(Iid, TransSet)
        which, for each item, lists of bitmaps the transactions involving it.









Note that in MBR  T, BT, I and BI  usually only record existence/non-existence
of each item in market baskets, not the number of particular items


     We can think of it this way:  An item id is a UPC (universal Product Code)
     or barcode only (identifying the "type" not the instances of that item).

     With nano-RFID tags, ePC will be used (electronic Product Code) wich not
     only distinguishes type but also instance of an item (like VINs for autos)
     When RFID item identification becomes ubiquitous, we will need to analyze
     by ePC not just UPC.





Much research still needs to be done when analyzing the data
where the number of each item (the counts) are imporant (e.g., ePC tagged items)




We can treat that situation by identifying instances of items and using T or BT
above, or by using

   COUNT TRANSACTION TABLE: CT(Tid, Item-1, Item-2, ..., Item-n),

         where values are the counts of each (UPC id-ed) item in that trans.





Note: T(Tid,ItemSet) and BT(Tid,I1..In)
           don't take account of hierarchical structures of items

             - e.g., in T, milk is an item at one level,

             - milk breaks down into skim, whole, 2%..  at a finer level...

             - Work on hierarchical MBR is ongoing.  New ideas are welcome!





In the Market Basket Research (MBR) models , these Vertical Boolean Tables are

    - extremely wide         (many many columns)
    - sometimes very shallow (not many trans e.g., Cancer data - few samples)
    - extremely sparse       (mostly 0's - i.e., no customer buys most of the
                              items in the store in one shopping trip!)






Bioinformatics/genetics data  is remarkably similar.


Microarray Data Analysis (MDA)  is the analysis of the gene expression levels
 of genes spotted on glass slides (Microarrays or Gene Chips) and subjected to
 "treatments" or experiments that record the gene expression level
     before (using red dye) and
     after (using green dye).

  For each experiment and each spotted gene, the logarithm of the ratio of
  red/green is recorded in a Microarray Data Analysis Table.



MDA is usually stored as an Excel spreadsheet:

         GeneTable:    GT(Gid,E1...En)

         row    = gene
         column = experiment (plus other columns)

         value = log ratio of r/g (a Real Number)





BinaryGeneTable:  BGT(Gid,E1...En)

        is the table you get by setting a threshold
        expression ratio and recording 1 iff it is exceeded:




     Note: sometimes 3-value logic is used, in which there is an expression
           threshold and a repression threshold.  We will call that the

     TernaryGeneTable:  TGT(Gid, E1,...,En)




 

BinaryExperimentTable: BGT(Eid, Gene1,...,GeneN) is similar to BT in MBR







****************************************************************

Formally in MBR, the BTT is defined as follows:


I={i1..im} is the set of items.
                         =====

     - eg, an item for purchase in a store

     - Each item in a store is an attribute, Ai,

       - with Boolean values (1 = in a customer's "market basket
                             or shopping cart" and 0=not in it).



An itemsets is a subset of I, (eg, set of items in a store)
   ========



A k-itemset is an itemset of cardinality (size) k
  =========


D={ti..tn} are transactions (eg, customer transaction at checkout)
               ============

  Each ti has an identifier and an itemset, ti = (t-id, t-itemset)



A transaction,t,

 SUPPORTS an itemset,A, if A IS CONTAINED IN t-itemset.
 ========


An Association rule is an implication A => C,
   ================       where A and C are disjoin itemsets.
                          ( A = antecedent and C = consequent)

        - rules have quality or interestingness measures,
            two of which are support and confidence:



SUPPORT OF ITEMSET B is the ratio, s, of transactions containing B
==================


SUPPORT OF RULE A=>C is the support of A u C
===============


CONFIDENCE OF A=>C = fraction of those trans suppporting A that
==========           also support C.

    - conf(A=>C) =  supp(AuC) / supp(A) 

    - The confidence measures the strength of the implication.
 
    - Both support and confidence can be measured as %'s.

    - As a %, confidence is the conditional probability,  P(C|A)




FREQUENT ITEMSETs are those with support >=  a threshold, minsupp.
================      
                   - The set of frequent k-itemsets is denoted Lk.



CONFIDENT RULEs are those with confidence >= a threshold, minconf.
==============         



STRONG RULEs are confident rules with frequent support sets.
=========== 


Given a user specified minsupp and minconf, our first task is to
find ALL strong rules, called Association Rule Mining, ARM, using:
                              =======================


1. Find all frequent itemsets, Lk.  (for each k > 1)

2. For each frequent itemset, B,
   find all strong rules supported by that frequent itemset
   (find all antecedent subsets, A  s.t.  A==>B-A  is strong)


     - the performance of ARM is largely determined by 1.




APRIORI ALGORITHM
=================

Based on the algorithm pruning technique:
      Any subset of frequent itemset must also be frequent.



FINDING FREQUENT ITEMSETS:

Start by finding all frequent 1-itemsets, L1.
Then candidates for L2 consist of joins of sets from L1,
 where 2 itemsets "join" if they're identical except for 1 member.


Let Ck = set of Candidate k-itemsets ( Lk-1 JOIN Lk-1 )

         1st Iteration:   Scan D for L1.
         Kth Iteration:   Create Ck as Lk-1 JOIN Lk-1
                          Scan Ck  for the frequent k-itemsets, Lk



GENERATING STRONG RULES:
For each B is in Lk, find all strong rules, A => B-A. 

A' < A -> supp(A') >= supp(A) -> conf(A'=>B-A') <= conf(A=>B-A)


If A is not a strong-rule-antecedent in B, then A' isn't either.

    So, for each L in Lk,

1. start with largest antecedent sets (k-1 item subsets)

2. next consider only (k-2)-item antecedents for which
       every (k-1)-item SuperAntecedent produced a strong rule

   Said another way (better way?),
   Consider only those 2-item consequents for which both
                       1-item subsets were strong rule consequents




SUMMARY:
   1. supp(B)    = |{t: B is a subset of t-itemset}| / |D|
   2. supp(A=>C) = supp(AuC)
   3. conf(A=>C) = supp(AuC)/supp(A)

APRIORI
   4. Scan D to find L1
   5. Form candidate 2-itemsets, C2, as L1 JOIN L1
   6. Scan C2 for L2;
      ...
   7. For each Lk,
          find strong minimal consequents;
          find strong minimal superset consequents of those, etc.







EXAMPLE 1:

I = {a,b,c,d,e};   D = {100,200,300,400}

Sample transaction database:

 TID  ItemLists
 ---  ------------------------------
 100   a             c      d 
 200          b      c             e 
 300   a      b      c             e 
 400          b                    e

minsupp=50%  (itemset is frequent if >= 2 transactions support it)
minconf=60%  (rule is confident if >= conditional prob >= .6)

The process of finding frequent itemsets:

C1             C2             C3               C4
Iset Sup Freq  Iset Sup Freq  Iset   Sup Freq  Freq Iset gen ends. 
{a}   2  y     {a,b}  1       {b,c,e}  2  y
{b}   3  y     {a,c}  2  y  
{c}   3  y     {a,e}  1    
{d}   1        {b,c}  2  y  
{e}   3  y     {b,e}  3  y
               {c,e}  2  y



Derive association rules.  

For frequent 3-itemsets, start with 1-item consequents:      conf?
Rule1:  b^c ==> e, confidence = 100%.  =Sup{b,c,e}/Sup{b,c}    y
Rule2:  b^e ==> c, confidence = 66.7%. =Sup{b,c,e}/Sup{b,e}    y
Rule3:  c^e ==> b, confidence = 100%.  =Sup{b,c,e}/Sup{c,e}    y

Form all 2-item consequents from high-conf 1-item consequents:
Rule4:  b ==> c^e, confidence = 66.7%. =Sup{b,c,e}/Sup{b}      y
Rule5:  c ==> b^e, confidence = 66.7%. =Sup{b,c,e}/Sup{c}      y
Rule6:  e ==> b^c, confidence = 66.7%. =Sup{b,c,e}/Sup{e}      y

For each frequent 2-Isets, start with 1-item consequents:
For {a,c}
Rule7:  a ==> c, confidence = 100%  = Sup{a,c}/Sup{a}          y
Rule8:  c ==> a, confidence = 66.7% = Sup{a,c}/Sup{c}          y

For {b,c}
Rule9:  b ==> c, confidence = 66.7% = Sup{b,c}/Sup{b}          y
Rule10: c ==> b, confidence = 66.7% = Sup{b,c}/Sup{c}          y

For {b,e}
Rule11: b ==> e, confidence = 100%  = Sup{b,e}/Sup{b}          y
Rule12: e ==> b, confidence = 100%  = Sup{b,e}/Sup{e}          y

For {c,e}
Rule13: c ==> e, confidence = 66.7% = Sup{c,e}/Sup{c}          y
Rule14: e ==> c, confidence = 66.7% = Sup{c,e}/Sup{e}          y

All 14 rules are high-confidence.




ESAMPLE 2:

minconf=80%, minsupp=50%:
We get the same frequent itemsets (since we have the same minsupp)

C1             C2             C3               C4
Iset Sup Freq  Iset Sup Freq  Iset   Sup Freq  Iset gen ends. 
{a}  2  y      {a,b} 1        {b,c,e} 2  y     
{b}  3  y      {a,c} 2  y  
{c}  3  y      {a,e} 1    
{d}  1         {b,c} 2  y  
{e}  3  y      {b,e} 3  y



Derive association rules.
For frequent 3-itemsets, start with 1-item consequents:     conf?
Rule1: b^c ==> e, confidence = 100%.  =Sup{b,c,e}/Sup{b,c}    y
Rule2: b^e ==> c, confidence = 66.7%. =Sup{b,c,e}/Sup{b,e}     
Rule3: c^e ==> b, confidence = 100%.  =Sup{b,c,e}/Sup{c,e}    y

then all 2-item consequents from high-conf 1-item consequents:
Rule5:  c ==> b^e,  confidence = 66.7%. =Sup{b,c,e}/Sup{c}       




For each frequent 2-Isets, start with 1-item consequents:
For {a,c}
Rule7:  a ==> c, confidence = 100%  = Sup{a,c}/Sup{a}         y
Rule8:  c ==> a, confidence = 66.7% = Sup{a,c}/Sup{c}                

For {b,c}
Rule9:  b ==> c, confidence = 66.7% = Sup{b,c}/Sup{b}                
Rule10: c ==> b, confidence = 66.7% = Sup{b,c}/Sup{c}                

For {b,e}
Rule11: b ==> e, confidence = 100%  = Sup{b,e}/Sup{b}         y
Rule12: e ==> b, confidence = 100%  = Sup{b,e}/Sup{e}         y

For {c,e}
Rule13: c ==> e, confidence = 66.7% = Sup{c,e}/Sup{c}                
Rule14: e ==> c, confidence = 66.7% = Sup{c,e}/Sup{e}                

Only Rules 1,3,7,11,12 are high-confidence.



EXAMPLE 3:    mconf=80%, msup=70%

 TId  Items 
 ---   -----------------------------
 100   a             c      d 
 200          b      c             e 
 300   a      b      c             e 
 400          b                    e

We get new frequent itemsets.

Cand_1-Isets        Cand_2-Isets       Cand_3-Isets is empty.
Iset Sup  Freq      Iset  Sup  Freq    Freq Iset generation
{a}    2            {b,c}   2          ends.
{b}    3  y         {b,e}   3  y  
{c}    3  y         {c,e}   2    
{d}    1                             
{e}    3  y                           

Derive association rules.

For frequent 2-itemsets, start with 1-item consequents:      conf?
Rule1:  b  => e,  confidence = 100%.  =Sup{b,e}/Sup{b}        y
Rule2:  e  => b,  confidence = 100%.  =Sup{b,e}/Sup{e}        y

Rules 1,2 are high-confidence.



EXAMPLE 4:    mconf=80%, msup=80%

We get new frequent itemsets.
 TId  Items 
 ---   -----------------------------
 100   a             c      d 
 200          b      c             e 
 300   a      b      c             e 
 400          b                    e

Cand_1-Isets        Cand_2-Isets is empty.
Iset Sup  Freq      Freq Iset generation
{a}    2            ends.
{b}    3            
{c}    3           
{d}    1                             
{e}    3                              

derive association rules.   There are no frequent itemsets.  done.

These examples should demonstrate how much the pruning rules
simplify the cases with higher support and confidence.




HASH-BASED techniques (hashing itemset counts)
=====================

  - To reduces the size of Ck for k > 1 (especially k=2), while
      scanning D to determine which itemsets in Ck are to be in Lk
      create a hash table of counts of (k+1)-itemsets


Example:

Take a transaction universe, D = {I1,I2,I3,I4,I5} as follows:

Tid   T-itemset
----  --------------
T100 | I1  I2  I5 
T200 | I2  I4 
T300 | I2  I3 
T400 | I1  I2  I4
T500 | I1  I3        
T600 | I2  I3
T700 | I1  I3
T800 | I1  I2  I3  I5
T900 | I1  I2  I3 
T1000| I1  I4

minsupp_cnt = 6     


While finding frequent 1-itemsets, creating count histogram
of the form:

Itemset Support
------- -------
{I1}          
{I2}          
{I3}          
{I4}        
{I5}          

Also create hash table by hashing (Ix,Iy) using
        H2(x,y)= ( x*5 + y )MOD7

.-----------------------------------------------------.
|bucket_addr|  0  |  1  |  2  |  3  |  4  |  5  |  6  |  
 -----------|-----|-----|-----|-----|-----|-----|-----|
 bucket_cnt |     |     |     |     |     |     |     |
 -----------|-----|-----|-----|-----|-----|-----|-----|
 buck_content     |     |     |     |     |     |     |

 (Note that the bucket_content is just include to aid the reader)



Starting scan at:   T100 | I1  I2  I5 

Itemset Support
------- -------
{I1}     1    
{I2}     1    
{I3}          
{I4}        
{I5}     1    

H2(1,2)= (1*5+2=7)MOD7 = 0
H2(1,5)= (1*5+5=10)MOD7= 3
H2(2,5)= (2*5+5=15)MOD7= 1

.-----------------------------------------------------.
|bucket_addr|  0  |  1  |  2  |  3  |  4  |  5  |  6  |  
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt |   1 |   1 |     |   1 |     |     |     |
`-----------+-----+-----+-----+-----+-----+-----+-----'

buck_content|I1 I2|I2 I5|     |I1 I5|     |     |     |
            |     |     |     |     |     |     |     |



Continuing scan at:      T200 | I2  I4 

Itemset Support
------- -------
{I1}     1    
{I2}     2    
{I3}          
{I4}     1  
{I5}     1    

H2(2,4)= (2*5+4=14)MOD7= 0

.-----------------------------------------------------.
|bucket_addr|  0  |  1  |  2  |  3  |  4  |  5  |  6  |  
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt |   2 |   1 |     |   1 |     |     |     |
`-----------+-----+-----+-----+-----+-----+-----+-----'

buck_content|I1 I2|I2 I5|     |I1 I5|     |     |     |
            |I2 I4|     |     |     |     |     |     |



Continuing scan at:    T300 | I2  I3 

H2(2,3)= (2*5+3=13)MOD7= 6

Itemset Support
------- -------
{I1}     1    
{I2}     3    
{I3}     1    
{I4}     1  
{I5}     1    

.-----------------------------------------------------.
|bucket_addr|  0  |  1  |  2  |  3  |  4  |  5  |  6  |  
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt |   2 |   1 |     |   1 |     |     |   1 |
`-----------+-----+-----+-----+-----+-----+-----+-----'

buck_content|I1 I2|I2 I5|     |I1 I5|     |     |I2 I3|
            |I2 I4|     |     |     |     |     |     |




Continuing scan at:    T400 | I1  I2  I4

H2(1,2)= (1*5+2=7)MOD7= 0
H2(1,4)= (1*5+4=9)MOD7= 2
H2(2,4)= (2*5+4=14)MOD7=0

Itemset Support
------- -------
{I1}     1    
{I2}     4    
{I3}     1    
{I4}     2  
{I5}     1    

.-----------------------------------------------------.
|bucket_addr|  0  |  1  |  2  |  3  |  4  |  5  |  6  |  
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt |   4 |   1 |   1 |   1 |     |     |   1 |
`-----------+-----+-----+-----+-----+-----+-----+-----'

buck_content|I1 I2|I2 I5|I1 I4|I1 I5|     |     |I2 I3|
            |I2 I4|     |     |     |     |     |     |
            |I1 I2|     |     |     |     |     |     |
            |I2 I4|     |     |     |     |     |     |


Continuing scan at:    T500 | I1  I3        

H2(1,3)= (1*5+3=8)MOD7= 1

Itemset Support
------- -------
{I1}     2    
{I2}     4    
{I3}     2    
{I4}     2  
{I5}     1    

.-----------------------------------------------------.
|bucket_addr|  0  |  1  |  2  |  3  |  4  |  5  |  6  |  
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt |   4 |   2 |   1 |   1 |     |     |   1 |
`-----------+-----+-----+-----+-----+-----+-----+-----'

buck_content|I1 I2|I2 I5|I1 I4|I1 I5|     |     |I2 I3|
            |I2 I4|I1 I3|     |     |     |     |     |
            |I1 I2|     |     |     |     |     |     |
            |I2 I4|     |     |     |     |     |     |



Continuing scan at:    T600 | I2  I3

H2(2,3)= (2*5+3=13)MOD7= 6

Itemset Support
------- -------
{I1}     2    
{I2}     5    
{I3}     3    
{I4}     2  
{I5}     1    

.-----------------------------------------------------.
|bucket_addr|  0  |  1  |  2  |  3  |  4  |  5  |  6  |  
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt |   4 |   2 |   1 |   1 |     |     |   2 |
`-----------+-----+-----+-----+-----+-----+-----+-----'

buck_content|I1 I2|I2 I5|I1 I4|I1 I5|     |     |I2 I3|
            |I2 I4|L1 I3|     |     |     |     |I2 I3|
            |I1 I2|     |     |     |     |     |     |
            |I2 I4|     |     |     |     |     |     |


Continuing scan at:    T700 | I1  I3

H2(1,3)= (1*5+3=8)MOD7= 1

Itemset Support
------- -------
{I1}     3    
{I2}     5    
{I3}     4    
{I4}     2  
{I5}     1    

.-----------------------------------------------------.
|bucket_addr|  0  |  1  |  2  |  3  |  4  |  5  |  6  |  
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt |   4 |   3 |   1 |   1 |     |     |   2 |
`-----------+-----+-----+-----+-----+-----+-----+-----'

buck_content|I1 I2|I2 I5|I1 I4|I1 I5|     |     |I2 I3|
(discontinue|I2 I4|I1 I3|     |     |     |     |I2 I3|
showing     |I1 I2|I1 I3|     |     |     |     |     |
buck_content)I2 I4



Continuing scan at:   T800 | I1  I2  I3  I5

H2(1,2)= (1*5+2=7)MOD7= 0
H2(1,3)= (1*5+3=8)MOD7= 1
H2(1,5)= (1*5+5=10)MOD7=3
H2(2,3)= (2*5+3=13)MOD7=6
H2(2,5)= (2*5+5=15)MOD7=1
H2(3,5)= (3*5+5=20)MOD7=6

Itemset Support
------- -------
{I1}     4    
{I2}     6    
{I3}     5    
{I4}     2  
{I5}     2    

.-----------------------------------------------------.
|bucket_addr|  0  |  1  |  2  |  3  |  4  |  5  |  6  |  
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt |   5 |   5 |   1 |   2 |     |     |   4 |
`-----------+-----+-----+-----+-----+-----+-----+-----'




Continuing scan at:    T900 | I1  I2  I3 

H2(1,2)= (1*5+2=7)MOD7= 0
H2(1,3)= (1*5+3=8)MOD7= 1
H2(2,3)= (2*5+3=13)MOD7=6

Itemset Support
------- -------
{I1}     5    
{I2}     7    
{I3}     6    
{I4}     2  
{I5}     2    

.-----------------------------------------------------.
|bucket_addr|  0  |  1  |  2  |  3  |  4  |  5  |  6  |  
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt |   6 |   6 |   1 |   2 |     |     |   5 |
`-----------+-----+-----+-----+-----+-----+-----+-----'



Continuing scan at:    T1000| I1  I4

H2(1,4)= (1*5+4=9)MOD7= 2

Itemset Support
------- -------
{I1}     6    
{I2}     7    
{I3}     6    
{I4}     3  
{I5}     2    

.-----------------------------------------------------.
|bucket_addr|  0  |  1  |  2  |  3  |  4  |  5  |  6  |  
|-----------|-----|-----|-----|-----|-----|-----|-----|
|bucket_cnt |   6 |   6 |   2 |   2 |     |     |   5 |
`-----------+-----+-----+-----+-----+-----+-----+-----'


Since Minsup_cnt = 6,   L1 = {I1,I2,I3}

The usual C2 would be { {I1,I2}, {I1,I3}, {I2,I3} }
 but by first applying H2 we see that C2 can be pruned
 to { {I1,I2} {I1,I3} } since H2({I2,I3})=6 and that bucket
 count is only 5.




DHP:
===

The above is an introduction to the DHP
 (Direct Hashing and Pruning) methods.





DIC:
===

Dynamic Itemset Counting method begins to count
 cand 3-itemsets before completing the count of cand 2-itemsets,
 cand 4-itemsets before completing the count of cand 3-itemsets,
 etc.  This reduces the number of database scans required.




TRANSACTION REDUCTION 
=====================

   (2nd method for improving the efficiency of Apriori)

   - A trans that does not contain any frequent k-itemsets
     cannot contain any frequent (k+1)-itemsets.
     Therefore it can be removed.




PARTITIONING (partitioning the data to find candidate itemsets)
============

 - partition D into n partitions (each with a minsup' of minsup/n)

 - throw out those transactions that don't achieve minsup' in any
    partition.




SAMPLING (mining on a subset of the given data)
========

 - Pick a sample, S;
   Look for frequent itemsets in S (may miss some)





FP-growth:
=========

Another method for improving the efficiency of finding
 frequent itemsets is the FP-growth method in which a
 complex data structure is constructed from which all
 frequent itemsets can be determined without doing
 additional database scans.


In this method we try to reduce time required to find frequent
 item sets by going right to the isolation of frequent itemsets
 without first generating candidate frequent item sets.  This
 will reduce the size and number of database scans required.

Assume a minimum support count of 2.

TID   Items 
T100 | I1  I2  I5
T200 | I2  I4 
T300 | I2  I3 
T400 | I1  I2  I4
T500 | I1  I3        
T600 | I2  I3
T700 | I1  I3
T800 | I1  I2  I3  I5
T900 | I1  I2  I3
T1000| I3  I4

First scan the database for frequent 1-itemsets and
 sort them in order of descending support count:
  L-order:  I2:7, I1:6, I3:6, I4:2, I5:2

Then create the root of the FP-tree and label it null:
                               (_) null

Scan the database processing items in L-order.
Create a branch in L-order for each trans (label nodes item:count)

T100 | I1  I2  I5 

Item_Header_Table          ....(_) null
Item  Cnt Link       I2:1_/                                            
----- --- ----  .---- > (_)                                              
{I2}   1-------'   I1:1_/                                    
{I1}   1----------- > (_)                      
{I3}   0             /                                        
{I4}   0       I5:1_/                                             
{I5}   1------- > (_)                                                   

To facilitate tree traversal, an Item_Header_Table (IHT) is built
 so each item is linked to its occurrences in the tree.

Continuing:   T200 | I2  I4 

Since we already have a I2:1 node linked from the root we share it
and increment its count (always share prefixes with existing paths

Item_Header_Table          ....(_) null
Item  Cnt Link       I2:2_/                                            
----- --- ----  .---- > (_)                                              
{I2}   2-------'   I1:1_/  \ _I4:1                           
{I1}   1----------- > (_)   (_)                
{I3}   0             /      ^                                 
{I4}   1-.     I5:1_/       :                                     
{I5}   1------- > (_)       :                                           
         :                  :                   
         `------------------'                   
                                                
Continuing:   T300 | I2  I3 
                                ____
Item_Header_Table           ...(null)    
Item  Cnt                 _/__                                          
----- ---       .--------(I2:3)                                              
{I2}   3-------'   ____/  _|__ \                          
{I1}   1----------(I1:1) (I4:1) (I3:1)      
{I3}   1-------------/------:----'                          
{I4}   1-.      ____/       :                                     
{I5}   1-------(I5:1)       :                                           
         :                  :                   
         `------------------'                   
                                                

Continuing:   T400 | I2  I1  I4
                                ____
Item_Header_Table           ...(null)    
Item  Cnt                 _/__                                          
----- ---       .--------(I2:4)                                              
{I2}   4-------'   ____/  _|__ \                          
{I1}   2----------(I1:2) (I4:1) (I3:1)      
{I3}   1-------------/-|--:--:---'                          
{I4}   2-.      ____/  |  .  .                                    
{I5}   1-.-----(I5:1) (I4:1) .                                          
         .                   .                  
         `-------------------'                   
                                                

Continuing:    T500 | I1  I3        

                   .........................
Item_Header_Table  :        ...(null)....  :
Item  Cnt          :      _/__           \_:__                          
----- -- ..........:.....(I2:4)          (I1:1)                              
{I2}   4.:         :___/  _\__ \          _\__            
{I1}   3..........(I1:2) (I4:1) (I3:1)---(I3:1)
{I3}   2............/.\...:..:...'                          
{I4}   2..      ___/   \__:  :                                    
{I5}   1.:.....(I5:1) (I4:1) :                                          
         :...................:                  
                                                

Continuing:     T600 | I2  I3
                   .........................
Item_Header_Table  :        ...(null)....  :
Item  Cnt          :      _/__           \_:__                          
----- -- ..........:.....(I2:5)          (I1:1)                              
{I2}   5.:         :___/  _\__ \          _\__            
{I1}   3..........(I1:2) (I4:1) (I3:2)---(I3:1)
{I3}   3............/.\...:..:...'                          
{I4}   2..      ___/   \__:  :                                    
{I5}   1.:.....(I5:1) (I4:1) :                                          
         :...................:                  
                                                
                                                

Continuing:   T700 | I1  I3
                   .........................
Item_Header_Table  :        ...(null)....  :
Item  Cnt          :      _/__           \_:__                          
----- -- ..........:.....(I2:5)          (I1:2)                              
{I2}   5.:         :___/  _\__ \          _\__            
{I1}   4..........(I1:2) (I4:1) (I3:2)---(I3:2)
{I3}   4............/.\...:..:...'                          
{I4}   2..      ___/   \__:  :                                    
{I5}   1.:.....(I5:1) (I4:1) :                                          
         :...................:                  
                                                
                                                
                                                

Continuing:   T800 | I2  I1  I3  I5
                   .........................
Item_Header_Table  :        ...(null)....  :
Item  Cnt          :     __/____         \_:__                          
----- -- ..........:....(__I2:6_)        (I1:2)                              
{I2}   6.:         :___/  _\____ \          _\__            
{I1}   5..........(I1:3) (_I4:1_) (I3:2)   (I3:2)
{I3}   5............/.\.\:......:...' `.....'  :           
{I4}   2..      ___/ __\_:\     :              :                  
{I5}   2.:....(I5:1)(I4:1)(I3:1):              :                           
         :.....:............\..::              :   
               :             \ :...............:
               :            __\_                
               :           (I5:1)
               :.............:
                                                

Continuing:    T900 | I2  I1  I3 
                   .........................
Item_Header_Table  :        ...(null)....  :
Item  Cnt          :     __/____         \_:__                          
----- -- ..........:....(__I2:7_)        (I1:2)                              
{I2}   7.:         :___/  _\____ \          _\__            
{I1}   6..........(I1:4) (_I4:1_) (I3:2)   (I3:2)
{I3}   5............/.\.\:......:...' `.....'  :           
{I4}   2..      ___/ __\_:\     :              :                  
{I5}   3.:....(I5:1)(I4:1)(I3:2):              :                           
         :.....:............\..::              :   
               :             \ :...............:
               :            __\_                
               :           (I5:1)
               :.............:
                                                


This give us all the frequent patterns of any length
and therefore no further database scans are necessary.

   - tremendous time savings!

   - only two database scans necessary and then extensive
     processing of the FP-tree.




Mining Distance-based Association Rules
=======================================

- Previous section described QARs where quantitative attribs
  are discretized initially by binning,
  then the intervals are combined.

- Such an approach may not capture semantics since it ignores
  distances

- A distance-based AR method captures the semantics of interval
  data while allowing for approximation in data values.

- a 2-phase algorithm can be used to mine distance based ARs.

- The FIRST PHASE employs clustering to find intervals or clusters

- a density threshold and a frequency threshold are required of
  a cluster (must be close and numerous)
       
- The SECOND PHASE obtains distance-based ARs by searching for
  groups of clusters that occur frequently together.

- To conclude that Ac => Cc, we want the antecedent-cluster, Ac,
  when projected onto the consequent-space to be within the
  consequent cluster, Cc.



ARM and Correlation Analysis
============================

- most ARM methods employ a support-confidence framework to
  weed out uninteresting or misleading rules.

- even strong ARs can be  uninteresting or misleading.

- methods of statistical independence and correlation analysis
  can help weed them out.

   - Example:



MISLEADING RULES and REDUNDANT RULES
====================================

In MBR basket case, consider T={tea} and C={coffee} |D|=100 trans.


MISLEADING:
        coffee NOTcoffee|total
        .---------------|----   Conf(T=>C)=|TUC|/|T|= 20/25=   .8
   tea  |   20     5    |  25   Conf(D=>C)=supp(C)=
NOTtea  |   70     5    |  75                 |C|/|D|=90/100=  .9
  ------|---------------|----   So the rule T >C is misleading.
  total |   90    10    | 100


REDUNDANT:
         coffee  NOTcoffee| tot
        .-----------------|----    C(T=>C)=20/22=   .9090
   tea  |   20       2    | 22     C(D=>C)=90/100=  .9000
NOTtea  |   70       8    | 78
  ------|-----------------|----    Within .0090 of each other
  total |   90      10    | 100    so they are redundant rules.




Text Association Rule Mining:
----------------------------

Given an alphabet, A, the
                                         12  n
nWordUniverse, Wn = A u AA u AAA u ... U AA..A

where, eg, AA = {ab | a and b are distinct and belong to A}.


Let W be a subset of U(i=1..n)Wi for some n  (the DICTIONARY)


Let S be a subset of U(i=1..m)Wi some m      (the SENTENCES)
      - m is usually bigger than n

We can do ARM on the Universe of "Items",       W,  and 
                                 "Transactions, S,  as above.


  - A HIGH CONFIDENCE RULE, A => F (A,F are disjoint WordSets)
    tells us: if all the A-words occur in a sentence then, with
    high confidence, all the F-words will also.

  - If the rule, A => F, has HIGH SUPPORT, that means all the
    A-words and all the F-words occur in most of the sentences.




There are alternate ways to deal with such text situations.







Machine Learning (ML)

is an older term for Data Mining, which included 2, CLASSIFICATION and CLUSTERING,

of the 3 Data Mining areas of: Assoc. Rule Minning, Classification and Clustering.

A still older term, Artificial Intelligence (AI), included all of these and much more.







CLASSIFICATION is the central area of the three!







Given a (large) TRAINING SET T(A1, ..., An, C) with  CLASS    C
                                               and  FEATURES  A&equiv(A1,...,An)
C-CLASSIFICATION of an unclassified
sample, (a1,...,an) is just:

           SELECT    Max (Count (T.Ci))
           FROM      T

           WHERE     T.A1 = a1  
           AND       T.A2 = a2   
           ... 
           AND       T.An = an

           GROUP BY  T.C;
 
i.e., just a SELECTION, since C-Classification is assigning to (a1..an)
                             the most frequent C-value in RA=(a1..an).




But, if the EQUALITY SELECTION is empty,
     then we need a FUZZY QUERY to find NEAR NEIGHBORs (NNs)
                                instead of exact matches. 

That's Nearest Neighbor Classification (NNC).




If SQL had a good Nearest Neighbor Set operator, we would be done.
But it doesn't, so NNC is essentially building a good NEAR NEIGHBOR operator.







E.g.,

Medical Expert System (Ask a Nurse): Symptoms plus past diagnoses
                                     are collected into a table called CASES

For each undiagnosed new_symptoms,
CASES is searched for matches:             SELECT DIAGNOSIS
                                           FROM   CASES
                                           WHERE  CASES.SYMPTOMS = new_symptoms;
If     there is a predominant DIAGNOSIS,
Then   report it,

ElseIf there's no predominant DIAGNOSIS,
Then   Classify instead of Query, i.e.,
       find fuzzy matches (near nbrs)      SELECT DIAGNOSIS
                                           FROM   CASES
                                           WHERE  CASES.SYMPTOMS ≅ new_symptoms
Else   call your doctor in the morning

       That's exactly (Nearest Neighbor) Classification!!






CAR TALK radio show: Click and Clack the Tappet brothers have a vast
      TRAINING SET of car problems and solutions built from experience.

      They search that TRAINING SET for close matches to predict solutions
           based on previous successful cases.

      That's exactly (Nearest Neighbor) Classification!!






We all perform Nearest Neighbor Classification every day of our lives.

E.g., We learn when to apply specific programming/debugging techniques so that
      we can apply them to similar situations thereafter.






COMPUTERIZED NNC &equiv MACHINE LEARNING

                 (most clustering (which is just partitioning) is done as
                                   a simplifying prelude to classification).









Again, given a TRAINING SET, R(A1,..,An,C), with C=CLASSES and (A1..An)=FEATURES

Nearest Neighbor Classification (NNC) &equiv

    selecting a set of R-tuples with similar features (to the unclassified sample)

            and then letting the corresponding class values vote.





Nearest Neighbor Classification won't work very well if
                 the   vote is inconclusive (close to a tie)
                 or if similar (near) is not well defined, then we

                 build a MODEL of TRAINING SET
                                (at, possibly, great 1-time expense?)






When a MODEL is built first the technique is called Eager classification,

whereas

model-less methods like Nearest Neighbor are called Lazy or Sample-based.









Eager Classifiers models can be:

                              decision trees,
                              probabilistic models (Bayesian Classifier), 
                              Neural Networks,
                              Support Vector Machines, ...








How do you decide when an EAGER model is good enough to use?
How do you decide if a Nearest Neighbor Classifier is working well enough?




We have a TEST PHASE.

    typically, we set aside some training tuples as a Test Set.
    (then, of course, those Test tuples cannot be used in model building or
                                    and cannot be used as nearest neighbors)
   
If the classifier passes the the test
(a high enough % of Test tuples are correctly
 classified by the classifier) it is accepted.











EXAMPLE 1:

Computer Ownership TRAINING SET for predicting who owns a computer:

 Customer  Age   Salary   Job           Owns Computer
       1 |  24 | 55,000 | Programmer  | yes
       2 |  58 | 94,000 | Doctor      | no 
       3 |  48 | 14,000 | Laborer     | no 
       4 |  58 | 19,000 | Domestic    | no 
       5 |  28 | 18,000 | Construction| no 

A classifier might be built from this TRAINING SET (e.g., a decision tree) as follows:

                Age < 30
                /       \
              T           F
             /             \
      Salary > 50K          No
        /        \
      T            F
     /              \
 Yes                 No              

It is easy to determine a pattern in this small dataset, however for large
 datasets it is impossible to construct a decision tree model by "sight".

Therefore we need a Model Building Algorithm or training algorithm








EXAMPLE 2:

PRECISION AG YIELD CLASSIFIER predicts YIELD of a field grid cell
 based on mid-year Blue, Green, Red, NearInfraRed reflectances from that cell.
 The TRAINING SET is R(CELL, Blue, Green, Red, NIR, YIELD) from previous year.

1st Separate out a Test Set.

2nd Build a CLASSIFIER MODEL (decision tree) from remaining TRAINING SET

3rd Test MODEL accuracy using the Test Set.  If it passes the test,

       then when an aerial photo is taken during the growing season,
            predict where low yeild can be expected using the MODEL
            (then apply additional nutrients to those cells?)





TRAINING SET

 X  Y   Blue_____   Green____    Red_____    NIR_____   YIELD_
 0  0 | 0000 1001 | 1010 1111 | 0000 0110 | 1111 0101 | medium  
 0  1 | 0000 1011 | 1011 0100 | 0000 0101 | 1111 0111 | medium  
 0  2 | 0000 1011 | 1011 0101 | 0000 0100 | 1111 0111 | high   
 0  3 | 0000 0111 | 1011 0111 | 0000 0011 | 1111 1000 | high  
 0  4 | 0000 0111 | 1011 1011 | 0000 0001 | 1111 1001 | high 
 0  6 | 0000 1000 | 1011 1111 | 0000 0000 | 1111 1011 | high
 1  0 | 0001 1101 | 1010 1110 | 0000 0111 | 1111 0100 | medium  
 2  1 | 0000 1111 | 1011 0101 | 0000 0110 | 1111 0110 | medium 
 3  2 | 0001 1111 | 1011 0111 | 0000 0101 | 1111 0110 | medium
 4  3 | 0001 1111 | 1011 0110 | 0000 0010 | 1111 1000 | high 
 5  4 | 0001 1111 | 1011 1010 | 0000 0010 | 1111 1000 | high
 6  6 | 0001 1111 | 1011 1110 | 0000 0001 | 1111 1010 | high    





Separate out as TEST SET

 X  Y   Blue_____   Green____    Red_____    NIR_____   YIELD_
 1  0 | 0001 1101 | 1010 1110 | 0000 0111 | 1111 0100 | medium  
 2  1 | 0000 1111 | 1011 0101 | 0000 0110 | 1111 0110 | medium 
 3  2 | 0001 1111 | 1011 0111 | 0000 0101 | 1111 0110 | medium
 4  3 | 0001 1111 | 1011 0110 | 0000 0010 | 1111 1000 | high 
 5  4 | 0001 1111 | 1011 1010 | 0000 0010 | 1111 1000 | high
 6  6 | 0001 1111 | 1011 1110 | 0000 0001 | 1111 1010 | high
 
                           
                           


TRAIN a Classifier with the remainder (a decision tree)

REMAINDER of the TRAINING SET

 X  Y   Blue_____   Green____    Red_____    NIR_____   YIELD_
 0  0 | 0000 1001 | 1010 1111 | 0000 0110 | 1111 0101 | medium 
 0  1 | 0000 1011 | 1011 0100 | 0000 0101 | 1111 0111 | medium  
 0  2 | 0000 1011 | 1011 0101 | 0000 0100 | 1111 0111 | high 
 0  3 | 0000 0111 | 1011 0111 | 0000 0011 | 1111 1000 | high  
 0  4 | 0000 0111 | 1011 1011 | 0000 0001 | 1111 1001 | high   
 0  6 | 0000 1000 | 1011 1111 | 0000 0000 | 1111 1011 | high    



                    ____________________________________
                   /                   |                \
                 /                     |                  \
               /                       |                    \
             /                         |                      \
   NIR ≤ 01000000          01000000 < NIR ≤ 11110111        11110111 < NIR
 ^ Red ≥ 00100000        ^ 00100000 > Red ≥ 00000101      ^ 00000101 > Red
       /                               |                           \
     /                                 |                             \
YIELD = low                      YIELD = medium                   YIELD = high

                                             




TEST Classifier            

TEST SET

 X  Y   Blue_____   Green____   Red_____    NIR_____    YIELD_   PREDICTED YIELD 
 1  0 | 0001 1101 | 1010 1110 | 0000 0111 | 1111 0100 | medium  | medium
 2  1 | 0000 1111 | 1011 0101 | 0000 0110 | 1111 0110 | medium  | medium
 3  2 | 0001 1111 | 1011 0111 | 0000 0101 | 1111 0110 | medium  | medium
 4  3 | 0001 1111 | 1011 0110 | 0000 0010 | 1111 1000 | high    | high
 5  4 | 0001 1111 | 1011 1010 | 0000 0010 | 1111 1000 | high    | high
 6  6 | 0001 1111 | 1011 1110 | 0000 0001 | 1111 1010 | high    | high

Tests out to be 100% correct (Gets and A grade!).





USE Classifier Model (decision tree) to classify:

New Data: R,G,B,NIR from an aerial image taken on ~4th of July:


 X  Y   Blue_____   Green____  Red______   NIR______
 0  6 | 0001 1100 | 1011 1110  0000 0001 | 1111 1110
                              
                    ___________________ ===================  
                   /                   |                    \\ 
                 /                     |                     \\ 
               /                       |                      \\ 
             /                         |                       \\ 
   NIR ≤ 01000000          01000000 < NIR ≤ 11110111        1111 0111 < NIR
 ^ Red ≥ 00100000        ^ 00100000 > Red ≥ 00000101      ^ 0000 0101 > Red
       /                               |                          \\
     /                                 |                           \\ 
YIELD = low                      YIELD = medium                   YIELD = high


                                             
















Preparing Data for Classification


  Data Cleaning (of noise and missing values)

     Remove Noise (or reduce noise) by "smoothing"

     Fill in missing values (with most common or some statistical value)

                 NOTE: Even Noise and Missing Value management can be done by a
                       Nearest Neighbor Vote!  (called interpolation)

     Feature Extraction to eliminate irrelevant attributes (e.g., in the PA example,
                 eliminate Blue, Green since they're irrelevant to the decision).









Ways of Comparing Different Classification Methods

     Predictive Accuracy (predicting the class label of new data)

     Speed (computation costs for generating and using the model)

     Robustness (does it give almost the same predictions when
                 the Training Set are almost the same?)

     Scalability (Model construction efficiency - massive datasets)










More Detail on Some Classification Methods:






K-Nearest-Neighbor Classification   







Decision Tree Models for EAGER CLASSIFICATION:

                     each inode is a test on a feature attribute (composite?),

                     each test outcome is assigned a link to the next level
                                   (outcome=a value or range of values or...)
                     each leaf represent a class (or distribution of classes)



Unknown sampes are classified by their testing feature attributes against the tree.


The leaf arrived at, holds the class prediction for that sample.

     Some branches may represent noise or outliers (and should be pruned?)


The ID3 algorithm for inducing a decision tree from training tuples is:


   1. The tree starts as a single node containing the entire TRAINING SET.

   2. If all TRAINING TUPLES have the same class, this node is a leaf. DONE.

   3. Otherwise, use a measure, information gain, as a heuristic for
      selecting the best decision attribute for that node

   4. Branch is created for each value [interval of values] of that test attribute
      and the TRAINING SET is partitioned accordingly.

   5. Recurses on 2,3,4, until The Stopping Condition is true.






Possible Stopping Conditions:

All samples for a given node belong to the same class (label with that class)

∃ no remaining candidate decision attributes (label with plurality class).

Some other stopping rule.










Information Gain as an Attribute Selection Measure

           Minimizes expected number of tests needed to classify an object
           and guarantees simple tree (not necessarily the simplest)


At any stage, let

S    = {s1,...,sm} be a TRAINING SUBSET.

S[C] = {C1,...,Cc} be the distinct classes in S.




EXPECTED INFORMATION needed to classify a sample given S as TRAINING SET is:

I{s1,...,sm} = -∑i=1..mpi*log2(pi)     pi= |S∩Ci|/|S|


Choosing A as decision attribute, the
Expected Classification Info gained is


E(A) = ∑j=1..v; i=1..m ( si,j/|S| * I{sij..smj} )  where Skh = SA=ak∩Ch




Gain(A) = I(s1..sm) - E(A)

   - expected reduction of info required to classify
     after splitting via A-values.

The algorithm above computes the information gain of each
 attribute and selects the one with the highest information gain
 as the test attribute.

Branches are created for each value of that attribute and samples
 are partitioned accordingly.



Pruning
=======

When a decision tree is built, many of the branches will reflect
 anomalies in the training data due to noise or outliers.

Tree pruning methods address this problem of "overfitting" the
 data (classifying situations that are erroneous or accidental).

Such methods typically use statistical measures to remove the
 least reliable branches, resulting in faster classification and
 an improvement in the ability of the tree to corredtly classify
 independent test data.




Extracting Classification Rules from Decision Tress

One rule per each path from root to leaf.

Each (attr,value) along path forms a conjunction in the antecedent
   
Leaf holds class prediction or consequent.

May be easier for humans to understand rules.

    

More on Decision Tree Induction (powerpoint Introduction)   







Note that the notion of "near" requires a distance or similarity measure to exist.
What are some of them?

Metrics (distance functions on feature space)   







The example:

Training Data:


Band B1:      Band B2:      Band B3:      Band B4:
 3  3  7  7    7  3  3  2    8  8  4  5   11 15 11 11
 3  3  7  7    7  3  3  2    8  8  4  5   11 11 11 11
 2  2 10 15   11 11 10 10    8  8  4  4   15 15 11 11
 2 10 15 15   11 11 10 10    8  8  4  4   15 15 11 11

S:  
X-Y  B1   B2   B3   B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011


Suppose that B1 is the class label attribute (e.g., Yield)
Then the class labels are 2, 3, 7, 10, 15 (C1,..,C5).

We need to know the count of the number of pixels (rows in
 the table above) that contain each value in each attribute.

We also need to know the count of pixels that contain pairs of
 values, one from a descriptive attribute and the other from the
 class label attribute.

Moreover we may wish to focus on only a portion of the dataset
 (some part of the field) before making those count calculations.

The Ptree structure is perfect for providing those counts.


B11  B12  B13  B14
0000 0011 1111 1111
0000 0011 1111 1111
0011 0001 1111 0001
0111 0011 1111 0011

BASIC_PTREES_band1___________________
P1,1      P1,2      P1,3      P1,4
5         7         16        11
0 0 1 4   0 4 0 3             4 4 0 3
    3           0                   0  <-where "different" bit is

VALUE_PTREES_band1___________________ (2-bit precision, 3, etc):
P1(00)    P1(01)    P1(10)    P1(11)
7         4         2         3    
4 0 3 0   0 4 0 0   0 0 1 1   0 0 0 3
    3                   3 0         0

P1(000) P1(010) P1(100) P1(110) P1(001)  P1(011)  P1(101)  P1(111)
0       0       0       0       7        4        2        3      
                                4 0 3 0  0 4 0 0  0 0 1 1  0 0 0 3 
                                    3                 3 0        0

P1(0000 P1(0100 P1(1000 P1(1100 P1(0010  P1(0110  P1(1010  P1(1110
0       0       0       0       3        0        2        0      
                                0 0 3 0           0 0 1 1        
                                    3                 3 0      
P1(0001 P1(0101 P1(1001 P1(1101 P1(0011  P1(0111  P1(1011  P1(1111
0       0       0       0       4        4        0        3      
                                4 0 0 0  0 4 0 0           0 0 0 3 
                                                                 0 


B21  B22  B23  B24
0000 1000 1111 1110 
0000 1000 1111 1110    
1111 0000 1111 1100  
1111 0000 1111 1100 
                   
BASIC_PTREES_band2___________________
P2,1      P2,2      P2,3        P2,4
8         2         16          10
0 0 4 4   2 0 0 0               4 2 4 0
          02                      02        <-positions of the
                                              two 1-bits
VALUE_PTREES_band2___________________
P2(00)    P2(01)    P2(10)      P2(11)
6         2         8           0      
2 4 0 0   2 0 0 0   0 0 4 4               
13        02

P2(000  P2(010  P2(100  P2(110 P2(001   P2(011   P2(101   P2(111
0       0       0       0      6        2        8        0      
                               2 4 0 0  2 0 0 0  0 0 4 4          
                               13       02

P2(0000 P2(0100 P2(1000 P2(1100 P2(0010  P2(0110  P2(1010  P2(1110
0       0       0       0       2        0        4        0      
                                0 2 0 0           0 0 0 4         
                                  13            
P2(0001 P2(0101 P2(1001 P2(1101 P2(0011  P2(0111  P2(1011  P2(1111
0       0       0       0       4        2        4        0      
                                2 2 0 0  2 0 0 0  0 0 4 0         
                                1302     02

B31  B32  B33  B34
1100 0011 0000 0001                           
1100 0011 0000 0001                         
1100 0011 0000 0000                        
1100 0011 0000 0000                       

BASIC_PTREES_band3___________________
P3,1      P3,2      P3,3      P3,4
8         8         0         2 
4 0 4 0   0 4 0 4             0 2 0 0
                                13

VALUE_PTREES_band3___________________
P3(00)    P3(01)    P3(10)    P3(11)
0         8         8         0      
          0 4 0 4   4 0 4 0               

P3(000) P3(010)   P3(100)  P3(110) P3(001) P3(011) P3(101) P3(111)
0       8         8        0       0       0       0       0      
        0 4 0 4   4 0 4 0                                                 

P3(0000 P3(0100   P3(1000  P3(1100 P3(0010 P3(0110 P3(1010 P3(1110
0       6         8        0       0       0       0       0      
        0 2 0 4   4 0 4 0                                                  
          02                                                               
P3(0001 P3(0101  P3(1001 P3(1101 P3(0011 P3(0111 P3(1011 P3(1111
0       2        0       0       0       0       0       0      
        0 2 0 0                                                            
          13


B41  B42  B43  B44
1111 0100 1111 1111         
1111 0000 1111 1111          
1111 1100 1111 1111            
1111 1100 1111 1111             

BASIC_PTREES_band4___________________
P4,1      P4,2      P4,3      P4,4
16        5         16        16
          1 0 4 0                    
          1                         

VALUE_PTREES_band4___________________
P4(00     P4(01     P4(10     P4(11
0         0         11        5      
                    3 4 0 4   1 0 4 0     
                    1         1

P4(000  P4(010  P4(100  P4(110  P4(001  P4(011  P4(101   P4(111
0       0       0       0       0       0       11       5      
                                                3 4 0 4  1 0 4 0
                                                1        1

P4(0000 P4(0100 P4(1000 P4(1100 P4(0010 P4(0110 P4(1010 P4(1110
0       0       0       0       0       0       0       0      
                                                                             
P4(0001 P4(0101 P4(1001 P4(1101 P4(0011 P4(0111 P4(1011   P4(1111
0       0       0       0       0       0       11        5      
                                                3 4 0 4   1 0 4 0
                                                1         1



Suppose we take this relation as training set (4-bit values).
Let B1 be the class label attribute.
Then the classes are:
                      { C1,C2,C3,C5,C5 } =
                      {  2, 3, 7,10,15 } where Ci={ci}.

The ID3 alg for inducing a decision tree from training samples:
S:  
X-Y  B1   B2   B3   B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011

1. Tree starts as one node representing the training samples, S.

2. If all samples are in same class (same B1-value)
   then S becomes a leaf with that class label.     [Not true!]

3. Else, use entropy-based, "information gain" as a heuristic for
   selecting the first decision attribute.


Take B2 = (a1,a2,a3,a4,a5} = { 2, 3, 7,10,11 }
           as the first candidate attribute.

        Aj={t:t(B2)=aj}, where a1=0010, a2=0011, a3=0111,
                               a4=1010, a5=1011.

        sij is number of samples of class, Ci, in a subset, Aj.

     so sij = rc( P1(ci)^P2(aj) ), where ci is in {2,3,7,10,15}
                                   and   aj is in {2,3,7,10,11}.

             ++---------+----------+----------+----------+--------
             || P2(2)   | P2(3)    | P2(7)    |P2(10)    |P2(11)      
             || 2       |  4       |  2       |  4       | 4
--.----------|| 0 2 0 0 |  2 2 0 0 |  2 0 0 0 |  0 0 0 4 | 0 0 4 0
ci|  P1(ci)  ||   13    |  1302    |  02      |          |    
==+==========++=========+==========+==========+==========+========
 2|  3       || 0       |  0       |  0       |  0       | 3      
  |  0 0 3 0 ||         |          |          |          | 0 0 3 0
  |      3   ||         |          |          |          |     3  
--+----------++---------+----------+----------+----------+--------
 3|  4       || 0       |  2       |  2       |  0       | 0      
  |  4 0 0 0 ||         |  2 0 0 0 |  2 0 0 0 |          |        
  |          ||         |  13      |  02      |          |        
--+----------++---------+----------+----------+----------+--------
 7|  4       || 2       |  2       |  0       |  0       | 0      
  |  0 4 0 0 || 0 2 0 0 |  0 2 0 0 |          |          |        
  |          ||   13    |    02    |          |          |        
--+----------++---------+----------+----------+----------+--------
10|  2       || 0       |  0       |  0       |  1       | 1      
  |  0 0 1 1 ||         |          |          |  0 0 0 1 | 0 0 1 0
  |      3 0 ||         |          |          |        0 |     3  
--+----------++---------+----------+----------+----------+--------
15|  3       || 0       |  0       |  0       |  3       | 0      
  |  0 0 0 3 ||         |          |          |  0 0 0 3 |        
  |        0 ||         |          |          |        0 |        
--+----------++---------+----------+----------+----------+--------


EXPECTED INFO needed to classify the sample is:

I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ],

m  = 5
s  = 16
si = 3,4,4,2,3   (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16

I  = -(3/16*lg2(3/16)+4/16*lg2(4/16)+4/16*lg2(4/16)+2/16*lg2(2/16)
     +3/16*lg2(3/16))

   = -(  -.453          -.5             -.5             -.375
         -.453       )

   = -( -2.281)      =       2.281    



ENTROPY based on the partition into subsets by B2 is

E(B2)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ]   where

Ij = I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj| 
           
Since m=5, the sij's are:
j=1        j=2        j=3        j=4        j=5  
---        ---        ---        ---        ---
0          0          0          0          3       <-- s1j
0          2          2          0          0       <-- s2j
2          2          0          0          0       <-- s3j
0          0          0          1          1       <-- s4j
0          0          0          3          0       <-- s5j
---        ---        ---        ---        ---
2          4          2          4          4       <- s1j+..+s5j



j=1        j=2        j=3        j=4        j=5  
---        ---        ---        ---        ---
2          4          2          4          4       <- |Aj|

where Aj's are the rootcounts of P2(aj)'s.



Therefore,
j=1        j=2        j=3        j=4        j=5  
---        ---        ---        ---        ---
0          0          0          0          .75    <-  p1j
0          .5         .5         0          0      <-  p2j
1          .5         0          0          0      <-  p3j
0          0          0          .25        .25    <-  p4j
0          0          0          .75        0      <-  p5j

and

j=1        j=2        j=3        j=4        j=5  
---        ---        ---        ---        ---
0*         0          0          0          -.311 <- p1j*log2(p1j)
0          -.5        -.5        0          0     <- p2j*log2(p2j)
0          -.5        0          0          0     <- p3j*log2(p3j)
0          0          0          -.5        -.5   <- p4j*log2(p4j)
0          0          0          -.311      0     <- p5j*log2(p5j)
--         ---        ---        -----      ----
0          1          -.5        .811       .811  <- I(s1j..s5j)
2          4          4          4          4     <- s1j+..+s5j

so that,

0          .25        -.125      .203       .203  (s1j+..+s5j)*
                                                   I(s1j..s5j)/16

                                           .531  E(B2)
                                          2.281  I(s1..sm)
                            GAIN(B2) - >  1.750  I(s1..sm)-E(B2)


NOTE: ONE CAN GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)

Footnote * (If pij = 0 why is p1j*log2(p1j) = 0?
            Hint: L'Hospital's Rule)



Continuing with B3
---------------------------------------------------------------

Take B3 = {a1,a2,a3} = {4,5,8} as the 2nd candidate attribute.

        Aj={t:t(B3)=aj}, where a1=0100, a2=0101, a3=1000,

        sij is number of samples of class, Ci, in a subset, Aj.

     so sij = rc( P1(ci)^P2(aj) ), where ci is in {2,3,7,10,15}
                                   and   aj is in {4,5,8}.

             ++---------+----------+----------+
             || P3(4)   | P3(5)    | P3(8)    |
             || 6       |  2       |  8       |
-------------|| 0 2 0 4 |  0 2 0 0 |  4 0 4 0 |
ci|  P1(ci)  ||   02    |    13    |          |
==+==========++=========+==========+==========+
 2|  3       || 0       |  0       |  3       |
  |  0 0 3 0 ||         |          |  0 0 3 0 |
  |      3   ||         |          |      3   |
--+----------++---------+----------+----------+
 3|  4       || 0       |  0       |  4       |
  |  4 0 0 0 ||         |          |  4 0 0 0 |
  |          ||         |          |          |
--+----------++---------+----------+----------+
 7|  4       || 2       |  2       |  0       |
  |  0 4 0 0 || 0 2 0 0 |  0 2 0 0 |          |
  |          ||   02    |    13    |          |
--+----------++---------+----------+----------+
10|  2       || 1       |  0       |  1       |
  |  0 0 1 1 || 0 0 0 1 |          |  0 0 1 0 |
  |      3 0 ||       0 |          |      3   |
--+----------++---------+----------+----------+
15|  3       || 3       |  0       |  0       |
  |  0 0 0 3 || 0 0 0 3 |          |          |
  |        0 ||       0 |          |          |
--+----------++---------+----------+----------+


EXPECTED INFO needed to classify the sample is the same as above:

I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ],

m  = 5
s  = 16
si = 3,4,4,2,3   (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16

I  = -(3/16*lg2(3/16)+4/16*lg2(4/16)+4/16*lg2(4/16)+2/16*lg2(2/16)
     +3/16*lg2(3/16))

   = -(  -.453          -.5             -.5             -.375
         -.453       )

   = -( -2.281)      =       2.281    



ENTROPY based on the partition into subsets by B3 is





E(B3)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ]   where

    I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj| 
           
The sij's are:
j=1       j=2        j=3
---       ---        ---
0          0          3      <-- s1j
0          0          4      <-- s2j
2          2          0      <-- s3j
1          0          1      <-- s4j
3          0          0      <-- s5j
---        ---        ---        
6          2          8      <- s1j+..+s5j

6          2          8      <- |Aj| (divisors)

0          0          .375   <-  p1j
0          0          .5     <-  p2j
.67        1          0      <-  p3j
.167       0          .125   <-  p4j
.5         0          0      <-  p5j

0          0          -.531  <- p1j*log2(p1j)
0          0          -.5    <- p2j*log2(p2j)
-.387      0          0      <- p3j*log2(p3j)
-.431      0          -.375  <- p4j*log2(p4j)
-.5        0          0      <- p5j*log2(p5j)
--         ---        ---    
1.318      0          1.406 <- I(s1j..s5j)=- sum of above
3          2          8      <- s1j+..+s5j

.247       0          .703   <- (s1j+..+s5j)*I(s1j..s5j)/16

                      .950  <-  E(B3) (sum of above)
                     2.281  <-  I(s1..sm)
          GAIN(B3) > 1.331  <-  I(s1..sm) - E(B3)



Continuing with B4=A={a1..av} used to classify S into {A1..Sv},
---------------------------------------------------------------

Take B4 = {a1,a2} = {11,15} as the 3rd candidate attribute.

        Aj={t:t(B4)=aj}, where a1=1101, a2=1111

        sij is number of samples of class, Ci, in a subset, Aj.

     so sij = rc( P1(ci)^P2(aj) ), where ci is in {2,3,7,10,15}
                                   and   aj is in {11,15}.

             ++---------+----------+
             || P4(11)  | P4(15)   |
             || 11      |  5       |
-------------|| 3 4 0 4 |  1 0 4 0 |
ci|  P1(ci)  || 1       |  1       | 
==+==========++=========+==========+
 2|  3       || 0       |  3       |
  |  0 0 3 0 ||         |  0 0 3 0 |
  |      3   ||         |      3   |
--+----------++---------+----------+
 3|  4       || 3       |  1       |
  |  4 0 0 0 || 3 0 0 0 |  1 0 0 0 |
  |          || 1       |  1       |
--+----------++---------+----------+
 7|  4       || 4       |  0       |
  |  0 4 0 0 || 0 4 0 0 |          |
  |          ||         |          |
--+----------++---------+----------+
10|  2       || 1       |  1       |
  |  0 0 1 1 || 0 0 0 1 |  0 0 1 0 |
  |      3 0 ||       0 |      3   |
--+----------++---------+----------+
15|  3       || 3       |  0       |
  |  0 0 0 3 || 0 0 0 3 |          |
  |        0 ||       0 |          |
--+----------++---------+----------+


EXPECTED INFO needed to classify the sample is the same as above:

I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ],

m  = 5
s  = 16
si = 3,4,4,2,3   (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16

I  = -(3/16*lg2(3/16)+4/16*lg2(4/16)+4/16*lg2(4/16)+2/16*lg2(2/16)
     +3/16*lg2(3/16))

   = -(  -.453          -.5             -.5             -.375
         -.453       )

   = -( -2.281)      =       2.281    



ENTROPY based on the partition into subsets by B4 is





E(B4)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ]   where

    I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj| 
           
The sij's are:
j=1       j=2        
---       ---       
0          3          <-- s1j
3          1          <-- s2j
4          0          <-- s3j
1          1          <-- s4j
3          0          <-- s5j
---        ---        
11         5          <- s1j+..+s5j

11         5          <- |Aj| (divisors)

0          .6         <-  p1j
.273       .2         <-  p2j
.364       0          <-  p3j
.091       .2         <-  p4j
.273       0          <-  p5j

0          -.442      <- p1j*log2(p1j)
-.511      -.464      <- p2j*log2(p2j)
-.531      0          <- p3j*log2(p3j)
-.315      -.464      <- p4j*log2(p4j)
-.511      0          <- p5j*log2(p5j)
--         ---        
1.868      1.37       <- I(s1j..s5j)= - sum of above
11         5          <- s1j+..+s5j

1.284      .428       <- (s1j+..+s5j)*I(s1j..s5j)/16

               1.712  <-  E(B4) (sum of above)
               2.281  <-  I(s1..sm)
    GAIN(B4) >  .568  <-  I(s1..sm) - E(B4)
and
    GAIN(B3) > 1.331  <-  I(s1..sm) - E(B3)
    GAIN(B2) > 1.750  <-  I(s1..sm) - E(B2)


Thus we select B2 as the first level decision attribute.


NOTE: WE GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)


4. Branches are created for each value of B2 and samples are
      partitioned accordingly (If a partition is empty, generate a
      leaf and label it with the most common class, C2,
      labeled with 0011).
                                                             
       .--- B2=0000 - > C2:0011                                       
       |--- B2=0001 - > C2:0011                              
       |--- B2=0010 - > Sample_Set_1                         
       |--- B2=0011 - > Sample_Set_2                         
       |--- B2=0100 - > C2:0011                              
       |--- B2=0101 - > C2:0011                              
       |--- B2=0110 - > C2:0011                              
  B2 --|--- B2=0111 - > Sample_Set_3                         
       |--- B2=1000 - > C2:0011                              
       |--- B2=1001 - > C2:0011                              
       |--- B2=1010 - > Sample_Set_4                         
       |--- B2=1011 - > Sample_Set_5                         
       |--- B2=1100 - > C2:0011                              
       |--- B2=1101 - > C2:0011                              
       |--- B2=1110 - > C2:0011                              
       `--- B2=1111 - > C2:0011                              

Sample_Set_1
X-Y  B1   B3   B4
0,3 0111 0101 1011
1,3 0111 0101 1011

Sample_Set_2
X-Y  B1   B3   B4
0,1 0011 1000 1111
0,2 0111 0100 1011
1,1 0011 1000 1011
1,2 0111 0100 1011

Sample_Set_3
X-Y  B1   B3   B4
0,0 0011 1000 1011
1,0 0011 1000 1011

Sample_Set_4
X-Y  B1   B3   B4
2,2 1010 0100 1011
2,3 1111 0100 1011
3,2 1111 0100 1011
3,3 1111 0100 1011

Sample_Set_5
X-Y  B1   B3   B4
2,0 0010 1000 1111
2,1 0010 1000 1111
3,0 0010 1000 1111
3,1 1010 1000 1111


NOTE WE DONT NEED TO LIST OUT THE SAMPLE_SETS IN ORDER TO CONTINUE


5. The Algorithm recurses to form decision tree for the samples at
   each partition.  Once an attribute is the decision attribute at
   a node, it is not considered further.

6. Stop when:
   a. All samples for a given node belong to the same class or
   b. no remaining attributes
         (label leaf with majority class among the samples)

We note all samples belong to the same class for nodes:
   Sample_Set_1, B2=0010, have class, C3:0111.
   Sample_Set_3, B2=0111, have class, C2:0011.


NOTE: ONE CAN GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)
  One can determine that these Sample_Sets contain only one
  B1 value (class label) from the Ptrees already computed:

             ++---------+----------+----------+----------+--------
             || P2(2)   | P2(3)    | P2(7)    |P2(10)    |P2(11)      
             || 2       |  4       |  2       |  4       | 4
--.----------|| 0 2 0 0 |  2 2 0 0 |  2 0 0 0 |  0 0 0 4 | 0 0 4 0
ci|  P1(ci)  ||   13    |  1302    |  02      |          |    
==+==========++=========+==========+==========+==========+========
 2|  3       || 0       |  0       |  0       |  0       | 3      
  |  0 0 3 0 ||         |          |          |          | 0 0 3 0
  |      3   ||         |          |          |          |     3  
--+----------++---------+----------+----------+----------+--------
 3|  4       || 0       |  2       |  2       |  0       | 0      
  |  4 0 0 0 ||         |  2 0 0 0 |  2 0 0 0 |          |        
  |          ||         |  13      |  02      |          |        
--+----------++---------+----------+----------+----------+--------
 7|  4       || 2       |  2       |  0       |  0       | 0      
  |  0 4 0 0 || 0 2 0 0 |  0 2 0 0 |          |          |        
  |          ||   13    |    02    |          |          |        
--+----------++---------+----------+----------+----------+--------
10|  2       || 0       |  0       |  0       |  1       | 1      
  |  0 0 1 1 ||         |          |          |  0 0 0 1 | 0 0 1 0
  |      3 0 ||         |          |          |        0 |     3  
--+----------++---------+----------+----------+----------+--------
15|  3       || 0       |  0       |  0       |  3       | 0      
  |  0 0 0 3 ||         |          |          |  0 0 0 3 |        
  |        0 ||         |          |          |        0 |        
--+----------++---------+----------+----------+----------+--------



   Thus the decision tree becomes:

       .--- B2=0000 - > C2:0011                                       
       |--- B2=0001 - > C2:0011                              
       |--- B2=0010 - > C3:0111                              
       |--- B2=0011 - > Sample_Set_2                         
       |--- B2=0100 - > C2:0011                              
       |--- B2=0101 - > C2:0011                              
       |--- B2=0110 - > C2:0011                              
  B2 --|--- B2=0111 - > C2:0011                              
       |--- B2=1000 - > C2:0011                              
       |--- B2=1001 - > C2:0011                              
       |--- B2=1010 - > Sample_Set_4                         
       |--- B2=1011 - > Sample_Set_5                         
       |--- B2=1100 - > C2:0011                              
       |--- B2=1101 - > C2:0011                              
       |--- B2=1110 - > C2:0011                              
       `--- B2=1111 - > C2:0011                              

Sample_Set_2 (for B2=0011)
X-Y  B1   B3   B4
0,1 0011 1000 1111
0,2 0111 0100 1011
1,1 0011 1000 1011
1,2 0111 0100 1011

Sample_Set_4 (for B2=1010)
X-Y  B1   B3   B4
2,2 1010 0100 1011
2,3 1111 0100 1011
3,2 1111 0100 1011
3,3 1111 0100 1011

Sample_Set_5 (for B2=1011)
X-Y  B1   B3   B4
2,0 0010 1000 1111
2,1 0010 1000 1111
3,0 0010 1000 1111
3,1 1010 1000 1111





Recursing the algorithm on Sample_Set_2 (B2=0011):


1. Subtree starts as single node, S = Sample_Set_2 (determined
   by B2=0011, so that ANDing with P2(3) gives correct counts).

2. Not all samples are in the same class (same B1-value),

3. So, use entropy-based measure, "information gain" as a
   heuristic for selecting the attribute that will best separate
   the samples into individual classes

NOTE: WE CAN GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)
We don't have to rescan the training_set to form the leaf
subsample_sets. We can just use the P-tree sets for those samples
That solves the problem (see 1. above).


Revising from 4. onward then (and expressing SubSampleSets as
revised P-trees):

       .--- B2=0000 - > C2:0011                                       
       |--- B2=0001 - > C2:0011                              
       |--- B2=0010 - > C3:0111                              
       |--- B2=0011 - > Sample_Set_2                         
       |--- B2=0100 - > C2:0011                              
       |--- B2=0101 - > C2:0011                              
       |--- B2=0110 - > C2:0011                              
  B2 --|--- B2=0111 - > C2:0011                              
       |--- B2=1000 - > C2:0011                              
       |--- B2=1001 - > C2:0011                              
       |--- B2=1010 - > Sample_Set_4                         
       |--- B2=1011 - > Sample_Set_5                         
       |--- B2=1100 - > C2:0011                              
       |--- B2=1101 - > C2:0011                              
       |--- B2=1110 - > C2:0011                              
       `--- B2=1111 - > C2:0011                              


For Sample_Set_2 (for B2=0011=3) (only 2 classes have count>0)
-------------++-------+-------+----------++---------+---------.
 \           ||P3(4)  |P3(5)  |P3(8)     ||P4(11)   |P4(15)   |
   \         ||6      |2      |  8       ||11       |5        |
    `-------.||0 2 0 4|0 2 0 0|  4 0 4 0 ||3 4 0 4  |1 0 4 0  |
             \|  02   |  13   |          ||1        |1        |
ci|P1(ci)^P2(3)=======+=======+==========++=========+=========|
 3|  2       ||0      |0      |  0       ||1        |0        |
  |  2 0 0 0 ||       |       |          ||1 0 0 0  |         |
  |  13      ||       |       |          ||1        |         |
--+----------++-------+-------+----------++---------+---------|
 7|  2       ||2      |0      |  0       ||2        |0        |
  |  0 2 0 0 ||0 2 0 0|       |          ||0 2 0 0  |         |
  |    02    ||  02   |       |          ||  02     |         |
--+----------++-------+-------+----------++---------+---------

EXPECTED INFO needed to classify the sample:
I = I(s1,s2) = -SUM(i=1,2)[ pi * log2(pi) ],
m  = 2      s  = 16
si = 2,2   (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 1/8, 1/8  
I  = -(2/16*lg2(2/16) + (2/16*lg2(2/16))
   = -(  -.375          -.375          )    =    .750

________________________________________________________

ENTROPY based on the partition into subsets by B3 is

Take B3 = {a1,a2,a3} = {4,5,8} as the 1st candidate attribute.

   Aj={t:t(B3)=aj}, where a1=0100, a2=0101, a3=1000,
   sij is number of samples of class, Ci, in a subset, Aj.
   sij=rc(P1(ci)^P2(aj))  where   ci in {3,7}  and   aj in {4,5,8}

ENTROPY based on the partition into subsets by B3 is

E(B3)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ]   where

    I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj| 
           
The sij's are:
j=1       j=2        j=3
---       ---        ---
0          0          0      <-- s1j
2          0          0      <-- s2j
---        ---        ---        
2          0          0      <- s1j+..+s3j

2          0          0      <- |Aj| (divisors)

0          undefined  undefined  <-  p1j
1          undefined  undefined  <-  p2j

(the undefined terms are dropped)

0                            <- p1j*log2(p1j)
0                            <- p2j*log2(p2j)
--         ---        ---    
0                           <- I(s1j....s3j)=- sum of above
2          0          0     <-   s1j+..+s3j

0          0          0      <- (s1j+..+s3j)*I(s1j..s3j)/16

                      0     <-  E(B3) (sum of above)
                      .75   <-  I(s1..sm)
          GAIN(B3) =  .75   <-  I(s1..sm) - E(B3)



Continuing with B4=A={a1..av} used to classify S into {A1..Sv},
---------------------------------------------------------------



Take B4 = {a1,a2} = {11,15} as the 2nd candidate attribute.
        Aj={t:t(B4)=aj}, where a1=1101, a2=1111
        sij is number of samples of class, Ci, in a subset, Aj.
     so sij = rc( P1(ci)^P2(aj) ), where ci is in {3,7}
                                   and   aj is in {11,15}.

The sij's are:
j=1       j=2        
---       ---        
1          0              <-- s1j
2          0              <-- s2j
---        ---                
3          0              <- s1j+s2j

3          0              <- |Aj| (divisors)

.33        undefined      <-  p1j
.67        undefined      <-  p2j

(the undefined terms are dropped)

-.541                        <- p1j*log2(p1j)
-.387                        <- p2j*log2(p2j)
--         ---        ---    
.928                        <- I(s1j,s2j)=- sum of above
3          0          0     <-   s1j+s2j

.174       0          0      <- (s1j+s2j)*I(s1j,s2j)/16

                      .174  <-  E(B3) (sum of above)
                      .75   <-  I(s1..sm)
          GAIN(B4) =  .576  <-  I(s1..sm) - E(B4)

          GAIN(B3) =  .75   <-  I(s1..sm) - E(B3)


So B3 is the decision attribute and so forth.

Note that no database scan has been needed at all!





ID3 DTI   





Bayesian Classification

Bayesian classifiers are statistical classifiers

7.4.1 Bayes Theorem

Let X be a data sample whose class label is unknown.
Let H be a hypothesis (ie, X belongs to class, C).
P(H|X) is the posterior probability of H given X.
P(H) is the prior probability of H.

Bayes Theorem:
P(H|X) = P(X|H)P(H)/P(X)

7.4.2 Naive Bayesian Classification


1. Each data sample is represented by feature vector, X=(x1..,xn)
depicting the measurements made on the sample from A1,..An, resp.


2. Given classes, C1,...Cm, the naive Bayesian Classifier will
 predict unknown data sample, X (with no class label), belongs to
 class, Cj (called the maximum posteriori hypothesis), having the
 highest posterior probability, conditioned on X
 ( P(Cj|X) > P(Ci|X),  i not j).


P(Cj|X) = P(X|Cj)P(Cj)/P(X)



3. P(X) is constant for all classes so we maximize P(X|Cj)P(Cj).
  If we assume equal liklihood of classes, maximize P(X|Cj),
  else P(Ci) estimated as si/s.

      From the PC-cube we see that s is the overall tuple count
      and si is the rootcount of DRollup[Bcube->C]i

      (thus, it is rc(VPCtree[Ci]) assuming C=Bn = rc(PCn1* AND
       ... AND PCnm*  where there Ci is m-bit string and there is
        a * for each 0 bit in the string)



4. To reduce the computational complexity of calculating all
 P(X|Cj)'s the naive assumption of conditional independence of
 values is often made (therefore the name "Naive Baysian"),
 thus, P(X|Ci)=P(xk|Ci)*..*P(xn|Ci).

For categorical attributes, P(xk|Ci)=sixk/si  where sixk= # of
 training samples of class, Ci, having Ak-value xk
 (PCn,Ci ^ PCk,xk, which is one AND program).

For continuous attributes, use Gaussian distribution to estimate
  P(xk|Ci).

  Once the P(xk|Ci)'s are estimated, the model is "trained".



Example:
Consider the training set, S, where B1 is the class label attribue
S:  
 B1   B2   B3   B4
0011 0111 1000 1011
0011 0011 1000 1111
0111 0011 0100 1011
0111 0010 0101 1011
0011 0111 1000 1011
0011 0011 1000 1011
0111 0011 0100 1011
0111 0010 0101 1011
0010 1011 1000 1111
0010 1011 1000 1111
1010 1010 0100 1011
1111 1010 0100 1011
0010 1011 1000 1111
1010 1011 1000 1111
1111 1010 0100 1011
1111 1010 0100 1011

__C1___   __C2___   __C3___   __C4___   __C5___
P1,0010   P1,0011   P1,0111   P1,1010   P1,1111
3         4         4         2         3      
0 0 3 0   4 0 0 0   0 4 0 0   0 0 1 1   0 0 0 3 

                       
P2,0010   P2,0011   P2,0111   P2,1010   P2,1011   
2         4         2         4         4     
0 2 0 0   2 2 0 0   2 0 0 0   0 0 0 4   0 0 4 0          

s1x2=0010 s1x2=0011 s1x2=0111 s1x2=1010 s1x2=1011
    0         0         0         0         1     <-- s1x2/s1
    0         .5        .5        0         0     <-- s2x2/s2
    .5        .5        0         0         0     <-- s3x2/s3
    0         0         0         .5        .5    <-- s4x2/s4
    0         0         0         1         0     <-- s5x2/s5


__C1___   __C2___   __C3___   __C4___   __C5___
P1,0010   P1,0011   P1,0111   P1,1010   P1,1111
3         4         4         2         3      
0 0 3 0   4 0 0 0   0 4 0 0   0 0 1 1   0 0 0 3 

P3,0100   P3,0101   P3,1000                                                  
6         2         8      
0 2 0 4   0 2 0 0   4 0 4 0                                                  

s1x3=0100 s1x3=0101 s1x3=1000                       
    0         0         1                         <-- s1x3/s1
    0         0         1                         <-- s2x3/s2
    .5        .5        0                         <-- s3x3/s3
    .5        0         .5                        <-- s4x3/s4
    1         0         0                         <-- s5x3/s5
                                                                             

__C1___   __C2___   __C3___   __C4___   __C5___
P1,0010   P1,0011   P1,0111   P1,1010   P1,1111
3         4         4         2         3      
0 0 3 0   4 0 0 0   0 4 0 0   0 0 1 1   0 0 0 3 

P4,1011   P4,1111
11        5      
3 4 0 4   1 0 4 0

s1x4=1011 s1x4=1111
    0         1                                   <-- s1x4/s1
    .75       .25                                 <-- s2x4/s2
    1         0                                   <-- s3x4/s3
    .5        .5                                  <-- s4x4/s4
    1         0                                   <-- s5x4/s5



5. In order to classify an unknown sample, X, P(X|Ci)P(Ci) is
 evaluated for each i, then X is assigned to the class for which
 it is maximum.  (  Evaluate, P(xk|Ci)*..*P(xn|Ci) * P(Ci)  )

s1x2=0010 s1x2=0011 s1x2=0111 s1x2=1010 s1x2=1011
    0         0         0         0         1     <-- s1x2/s1
    0         .5        .5        0         0     <-- s2x2/s2
    .5        .5        0         0         0     <-- s3x2/s3
    0         0         0         .5        .5    <-- s4x2/s4
    0         0         0         1         0     <-- s5x2/s5
s1x3=0100 s1x3=0101 s1x3=1000                       
    0         0         1                         <-- s1x3/s1
    0         0         1                         <-- s2x3/s2
    .5        .5        0                         <-- s3x3/s3
    .5        0         .5                        <-- s4x3/s4
    1         0         0                         <-- s5x3/s5
s1x4=1011 s1x4=1111
    0         1                                   <-- s1x4/s1
    .75       .25                                 <-- s2x4/s2
    1         0                                   <-- s3x4/s3
    .5        .5                                  <-- s4x4/s4
    1         0                                   <-- s5x4/s5

                           sixk/si's:    si/s  P(X|Ci)=P(Ci)
         x2   x3   x4     ------------   ----  -------------
Take X= 0011 1000 1011    0     1    0   3/16      0
                         1/2    1   3/4  4/16     3/32
                         1/2    0    1   4/16      0
                          0    1/2  1/2  2/16      0
                          0     0    1   3/16      0


So X is classified as C2.

So we see that, once the conditional probabilities, sixk/si, are
 derived from the P-trees, any new sample can be classified
 instantly.



How effective are Naive Bayesian Classifiers?

  - In theory they have low error rates in comparison to other
    classifiers.

  - in practice it is not always true, because the assumptions
    may not be valid.

  - Various empirical studies have found Naive Bayesian
    Classifiers to be comparable to decision tree and neural
    network classifiers in many domains.

  - They also provide a theoretical justification for other
    classifiers that do not explicitly use Bayes Theorem
    (e.g., under certain assumptions it can be shown that NN and
    curve-fitting algorithms (eg, ID3) output the "maximum
    posteriori hypothesis" as does the Naive Bayesian Classifier.




7.4.3 Bayesian Belief Networks (to handle cases where the naive
      assumption doesn't hold)

   - The Naive Assumption of "class conditional independence"
     (given the class label of a sample, the values of the
      attributes are conditionally independent of one another)
      which allows use of the simplifying formula:
      P(X|Ci)=P(xk|Ci)*..*P(xn|Ci), when true, produces the most
      accurate classifier of all.

   - In practice dependencies can exist between attributes
     (variables).

      - In spatial datasets, one approach would be to select out
        attributes that are independent and then use Naive
        Bayesian Classifiers.  (e.g., select RIR and leave out
        G since there is correlation between them)

   - However, with PC-trees we have a way to calculate P(X|Ci)
     directly.

     It is the AND of the tuple PC-tree for X with the value
     PC-tree for Ci (noting that X is a tuple in Rel[X]
     not Rel - eliminating Coord and C)


   - Bayesian Belief Networks specify the joint conditional
     probabilities and allow class conditional independence
     to be defined between subsets of attributes (variables)
     namely those subsets that are conditionally
     independent of oneanother.

      - Note that the notion of functional dependence in
        normalizing relations
        is a specification of conditional dependence.


A Belief Network (or Bayesian Belief Network or Bayesian
  Network or Probabilistic Network) is composed of two
  components,


1.  an acyclic directed graph (nodes=attributes or random
    variables; edges=variables (actual attributes or "hidden
    variables" such as medical syndrome in medical data)

      - each variable is conditionally independent of its
        non-descendents, given its parents.

2.  a Conditional Probability Table (CPT) for each variable,
    Z, specifying all P(Z|parentZ).


7.4'  Non-Naive Baysian Classifier (New section, shortcut to
      Baysian Belief Net for spatial data with Ptrees).

We can use Baysian Classification directly without the Naive
   assumption, since we do not need to make the simplifying Naive
   assumption that P(X|Ci)=P(xk|Ci)*..*P(xn|Ci) since we can
   compute the actual  P(X|Ci) directly (in fact it is a simpler
   program than the above) as:  TPC(X) ^ VPC(Ci).

We do not need Baysian belief networks to estimate these numbers!



Bayesian Classifiers   



Classification by Backpropagation

   - A Neural network is a set of connected input/output units
      - Each connection has a weight
      - In the learning phase, adjusts weights to learn to
        predict class of input samples.

   - Backpropagation is a particular Neural Network learning alg
      - It operates on a "multilayer feed-forward network"

Multi-layer Feed-Forward Neural Network


          Input          Hidden          Output
          layer          layer           layer

         .----.          .----.          .----.
 x1      |    |----------|    |----------|    |- >
         `----'-.      .-`----'-.      .-`----'
               \ `-..-' /      \ `-..-' /       
         .----..\-'  `-/..----..\-'  `-/..----.
 x2      |    |--\----/--|    |--\----/--|    |- >
         `----'\  \  /  /`----'\  \  /  /`----'
           .    \  \/  /   .    \  \/  /   .   
                 `./\.'          `./\.'         
           .      /\/\     .      /\/\     .     
                 / /\ \          / /\ \           
           .    /.'  `.\   .    /.'  `.\   .       
         .----./'      `\.----./'      `\.----.
 xi      |    |----------|    |----------|    |- >
         `----'   wij    `----'    wjk   `----'
                           Oj              Ok

   - Inputs correspond to attributes from training samples.

   - Weighted outputs of the Input units are fed to the Hidden
     units (many Hidden layers?).

   - Weighted outputs of last Hidden layer's units are fed to the
     Output units.

   - Output units emit the network's prediction for the given
     samples.

   - Hidden and Output units are often referred to as "neurodes"

   - A n-layer NN has n layers other than the Input layer
       (includes Hidden and Output).

   - NN is "feed-forward" since none of the weights cycle back to
     an input unit or output unit of a previous layer.


Defining a Network Topology

   - Specify the number of Input units
   - Specify the number of Hidden layers
   - Specify the number of Hidden units in each Hidden layer
   - Specify the number of Output units

   - Normalize the input values for each attribute in training
     set speeds up training.

Backpropagation

   - learns by iteratively,

     -  processing a set of training samples,

     -  comparing the network's prediction for each sample with
        the actual known class label.

     -  For each training sample, weights are modified to minimize
        mean-square error between the network's prediction and the
        actual class.

     -  These modifications made in a "backwards" direction, from
        Output layer, through each Hidden layer to the Input layer

     -  The weights will (usually) eventually converge, and the
        learning process stops.


The Backpropagation Algorithm:

(1) Initialize all weights and biases in "network";
(2) while terminating condition is not satisfied {
(3)   for each training sample X in "samples"
(4)       // Propagate the inputs forward:
(5)       for each hidden or output layer unit j {
(6)           Ij = SUM(i)[ wij*Oi + theta(j) ]
                  //compute the net input of the
                    unit j wrt previous layer, i//
(7)           Oj = 1/(1+e^(-Ij);}  //compute the output of
                                     each unit, j//
(8)       // Backpropagate the errors://
(9)       for each unit j in the output layer
(10)         Errj = Oj(1-Oj)(Tj-Oj);
          // compute the error with respect to the next higher
             layer, k//
(11)      for each unit j in the hidden layers, from last to
          1st hidden layer
(12)         Errj = (l)*Errj*Oi  // weight increment
(13)      for each weight wij in "network" {
(14)         DELTAwij = (l)*Errj*Oj  // weight increment
(15)         wij = wij + DELTAwij}   // weight update
(16)      for each bias THETAj in "network" {
(17)         DELTA(THETAj) = (l)*Errj;  //bias increment
(18)         THETAj = THETAj + DELTA(THETAj) }  // bias update
(19)      }}


(1) The weights in the network are initialized to small
      random numbers (eg, -1 to 1 or?).

Each unit has a "bias" also initialized to a small random num.


For the jth layer (as inputs to the jth layer, where the Input
    layer is 0) j=1..m (m+1 layers total, including input):

| w11 w12... w1n1 | 
| w21 w22... w2n1 |
| w31 w32... w3n1 |
|  .              | = W1
|  .              |
|  .              |
|wn01 wn02.. wn0n1|

| z(1)1 | 
| z(1)2 | 
| .     |  = Z1
| .     | 
| z(1)n1| 

etc.

(4)-(7) Net input to Hidden/Output unit,
    j: Ij=SUM(i)[wij*Oi+zj] where wij is the weight of the
    condition from unit, i, in previous layer to unit, j;
Oi is the output of unit i; zj is the bias of the unit
(threshold -varying activity of the unit)


Each units takes its net input; applies an "activation function"

      - symbolized activation of the neuron

      - logistic or simoid (or "squashing fctn since it maps
        a large input domain into [0,1]) is used: Given net
        Input, Ij, to unit j, then the output of unit j
        is: O'j = 1/(1+e^-Ij)

      - the logistic fctn is nonlinear and differentiable,
        allowing backprop algorithm to model classification
        problems that are linearly inseparable 


So the output, O'j, of unit-j, given
  - output from previous-layer, unit-i of Oi,
  - connection weight, wij,
  - bias zj, is:

(O1 O2..Onj-1) | w11 w12 ... w1nj | + ( z1 z2 ... znj )
                                      = ( I1 ... Inj )
               | w21 w22 ... w2nj |   
               | w31 w32 ... w3nj |  
               |  .               |
               |  .               | 
               |  .               |
               |wnj-11...wnj-1,nj |


and at each layer, 
               _____________1________________
Oj = f(Ij) =        -(SUM(i)[wij*)i+zj]
               1 + e               

We will write it using matrix motation as follows:



At layer j, the
from previous layer, outputs are Oj-1
                     weights are Wj
                     inputs  are Ij
                     outputs are Oj (after applying activation
                                     fctn)


  O(j-1)*Wj+Zj =>  Ij  =>  Oj=f(Ij)
              

(8)-(18) The error is propagated backwards once the output of
  the Output layer is computed, to update weights and biases.
  For Output unit, Om, Errm=Om(1-Om)(Tm-Om);  Om is "actual"
  output and Tm is the "true" output based on the known class
  label of the training sample

Noting that for f(x)= 1/(1-e^-x),     f'(x)= e^-x / (1+e^-x)^2

   and f(x)*(1-f(x))= 1/(1-e^-x) * (1 - 1/(1+e^-x)) =
                                             e^-x / (1+e^-x)^2

we see that we are just using a straight line assumption as to
  the input DELTA value since DELTA(x) = y' * DELTA(y),
  where y'=Om(1-Om) and  DELTA(y) = (Tm-Om)

The error in a Hidden layer-j, use the weighted sum of the errors
  of the units connected to j from the next layer:
                               Errj = Oj(1-Oj)*SUM(k)[Errk*wjk]

 where wjk=weight of connection from unit-j to a unit-k in the
 next higher layer and Errk is the error of unit-k.

Weights are updated:                  DELTAwij = (l) * Errj * Oj
        and wij = wij * DELTAwij
        where l=learning rate, a constant, typically in (0,1).

   - Backpropagation learns using a method of gradient descent
     to search for a set of weights that can model the given
     classification problem so as to minimize the mean squared
     distance between the network's class prediction and the
     actual class label of samples.

     The learning rate helps to avoid getting stuck at a local
     minimum in decision space; if too low, learning is very slow.
     If too high, thrashing between suboptimals can occur.
     A rule of thumb is to set the learning rate to 1/t
     where t=number of iterations through the training set so far.


Biases are updated:                   DELTA(zj) = (l) * Errj

Here we are updating the weights and biases after presentation of
 each sample (case updating).  Alternatively, weight and bias
 updates (DELTAs) can be accumulated in variables so that
 updating can be applied after the entire training set has been
 presented (epoch updating).

(one iteration through the training set is an epoch)

In theory (mathematical) epoch updating is better, yet in
 practice, case updating is more common since it tends to yield
 more accurate results.


(2)-(3) Training stops when either

   - all DELTAwij in the previous epoch were so small as to be
     below some threshold or

   - the % of samples misclassified in the previous epoch is
     below some threshold or

   - a pre-specified number of epochs has expired.

(in practice several hundred thousand epochs may be required.)

          Input          Hidden          Output
         .----.          .----.          .----.
 x1      | x1 |----------| X1 |----------| y1 |- > y1
         `----'-.      .-`----'-.      .-`----'
               \ `-..-' /      \ `-..-' /       
         .----..\-'  `-/..----..\-'  `-/..----.
 x2      | x2 |--\----/--| X2 |--\----/--| y2 |- > y2
         `----'\  \  /  /`----'\  \  /  /`----'
           .    \  \/  /   .    \  \/  /   .   
                 `./\.'          `./\.'         
           .      /\/\     .      /\/\     .     
                 / /\ \          / /\ \           
           .    /.'  `.\   .    /.'  `.\   .       
         .----./'      `\.----./'      `\.----.
 xI      | xI |----------| XJ |----------| yK |- > yK
         `----'   wij  zj`----'    Wjk ZK`----'


(x1..xI)*|w11..w1J|+|z1|=>f=>(X1..XJ)*|W11..W1K|+|Z1|=>f=>(y1..yK)   
         | .    . | |. |              |.    .  | |. |      
         |wI1..wIJ| |zJ|              |WJ1..WJK| |ZK|      

**************************************************************







Other Classification Methods

k-nearest Neighbor Classifiers (based on learning by analogy

- unknown samples are assigned to the most common class among
   its k-nearest neighbors in n-space.

- instance based.

- lazy or "as you go" learner (by contrast to decision trees where
  the classifier is constructed before new samples are considered)

- With respect to spatial data in REL organization, if B1 is the
  class label attribute, what should be meant by the k-nearest
  ngbrs?

- Let's assume there is one REL dataset for learning and the new
  samples are separate from it (e.g., for RGBY data, take the
  point of view that we use last years dataset with RGB and Y
  to train and are interested in classifying this years RGB data
  to predict the Y).



A Spatial k-nearest ngbr algorithm

Assume we have basic Ptrees for the training set.

We find the k-nearest ngbrs to a new sample, x, and then
   predict the class of x to be the majority class among those
   k ngbrs.

So we will find the closest k (or more) training tuples, based on
  a weighted Manhattan distance on the non-class attribute values
  (e.g., if B1 is the Class label attribute,
  wm_dis(x,y) = SUM(i=2..n)[wi*|yi-xi|], where 0= k done.
   (class label is the one that gives the max rootcount when its
   Ptree is ANDed with Px - i.e., we compute rc(Px^Pci) for each
   class label, ci and assign the one that gives max rootcount.)

2. If rc(Px) < k, remove the lowest-order bit from the
   highest-weight band value of x,
   (we will call the resulting tuple, x also - since it is just
    the original tuple x, with its Bi-value generalized one
    level up the value concept hierarchy to a 7-bit value instead
    of an 8-bit value).

Repeat 1 and 2 until rc(Px) >= k

(note, when we have removed the low order or 8th bit from all
 of the non-class-label attributes of x, we proceed to removing
 the 7th bit one attribute at a time, then the 6th bit and so
 forth.)

(note, we can decide to remove several bits at a time so as to
 reduce the complexity.  We may get a ngbr set that has many more
 than k ngbrs in it but that shouldn't be a problem.  If for some
 reason it seems important to get the smallest ngbr set that
 qualifies (closest to k) rather than ordering the attributes by
 "importance" we could calculate the ngbr set size for each
 attribute during each "bit removal pass" and pick the one that
 gives us the best ngbrset...  Lots of variations are possible.)

(note, while calculating the rc's above it would make sense to
 have an accumulator for the rc's for each attribute values
 for several of the passes (8-bit, 7-bit, ...  values).

 This can be done with a single scan parallel program (lots of
 accumulators however).  This gives us maximum flexibility in
 deciding the best ngbrset.  We could also be computing the
 Px^Pci rootcounts during this one single scan pass).

(Note, in the event that we get through 1-bit values without the
 ngbrset reaching size, k, (could that happen?

 How? and if so, what could be done about it?) we could make
 resort to the traditional training set scan to classify that
 particular new sample.)


Example:
Traning Dataset (B1 is the class label attribute and k=5):
X-Y  B1   B2   B3   B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011
Consider the new sample is:  x = ---- 1011 1000 1111

The basic PC_trees in PQ-list form:
PQ11:  23 3
PQ12:  1 31 32 33
PQ13:  pure
PQ14:  0 1 31 32 33

PQ21: 2 3
PQ22: 00 02
PQ23: 0 1 2 3
PQ24: 0 10 12 2

PQ31: 0 2
PQ32: 1 3
PQ33: null
PQ34: 11 13

PQ41: pure
PQ42: 01 2
PQ43: pure
PQ44: pure

Assume the weights order the bands from high-to-low B2,B3,B4
Consider the new sample is:  x = ---- 1011 1000 1111
C = {0010 0011 0111 1010 1111} (class labels)
The needed PQ-seq's are:
PQ1,0010: 20 21 22
PQ1,0011: 0
PQ1,0111: 1
PQ1,1010: 23 30
PQ1,1111: 31 32 33

PQ2,1011: 2
PQ3,1000: 0 2 
PQ4,1111: 00 02 2

1. If rc(Px) >= k done.  (class is s.t.  rc(Px^Pci) is max.)

   Px: 2  rc(Px)=4  NOT >= k=5

2. If rc(Px) <  k, loworder bit from next band value...
   Take off the loworder bit from B2:

PQ2,101 : 2 3  (gives the same result for Px so do same with B3)
PQ3,100 : 0 2  (gives the same result)
PQ4,111 : 00 02 2  (gives the same result)

next loworder bit removal:
PQ2,10 : 2 3  (gives the same result for Px so do same with B3)
PQ3,10 : 0 2  (gives the same result)
PQ4,11 : 00 02 2  (gives the same result)

next loworder bit removal:
PQ2,1 : 2 3  (gives the same result for Px so do same with B3)
PQ3,1 : 0 2  (gives the same result)
PQ4,1 : pure  
------------
PQx:    2    (gives the same result)

next loworder bit removal:
PQ2,1 : pure
PQ3,1 : 0 2
PQ4,1 : pure  
------------
PQx:    0 2  has rc = 8 >= 5.

rc(Px) >= k, class is s.t.  rc(Px^Pci) is max.)

PQ1,0010: 20 21 22
PQx:      0 2
--------------
          20 21 22   rc= 3

PQ1,0011: 0
PQx:      0 2
--------------
          0          rc= 4

PQ1,0111: 1
PQx:      0 2
--------------
          null       rc= 0

PQ1,1010: 23 30
PQx:      0 2
--------------
          23         rc= 1

PQ1,1111: 31 32 33
PQx:      0 2
--------------
          null       rc= 0

Thus, the class label for x is 0011

*********Notes *********************************************
Problems?

1. Consider the problem of a ngbr that is positioned in
 large numbers right near a quadrant boundary, so that
 it has ngbrs which don't appear to be ngbrs in the Ptree.
  (This may not be a problem, since we are dealing
   with whole values.  The real problem is 2.)

2. For a value like, 0111, note that it is at the edge,
   not the middle of the intervals,
   [0110,0111],
   [0100,0111],
   which are the ngbrhds used when removing the first 2
   low-order bits (note that the same thing happens with 1111
   but it is inevitably at the edge of all ngbrhds,
   while 0111 is not.)

   Better, 1st "nbrd" be [0110,1000] = [6,8]
                    2nd, [0100,1001] = [4,9].

   Or even better, 1st: [0110,1000] = [6,8],
                   2nd: [0101,1001] = [5,9].

[0111,0111]  [7,7]
[0110,1000]  [6,8]
[0101,1001]  [5,9]

Question:
In removing a loworder bit, can it be accomplished by ORing? e.g.,

To get:
PQ2,101 : 2 3

can we just OR:

PQ2,1011: 2
PQ24':    11 13 3
OR---------------
          11 13 2 3

where PQ24' is the comp of PQ24: 0 10 12 2
apparently not!

Note:
P2,101  =  P2,1010 v P2,1011 = (P2,101 ^ P24') v P2,1011 =
                     = (P2,101 v P2,1011) ^ (P24' v P2,1011)
                     =  P2,101            ^ (P24' v P2,1011) 

It's clear there is no way to construct, e.g., P21 from P2,11
 and the basic, P22 or its comp, since P2,11 is 1 where both
 P21 and P22 are 1.  Knowing where P22 is 1 doesn't tell me
 which of the pixels for which P22 is 0 have a 1 in P21.

That is to say, a 0 in P2,11 where P22 is also 0 tells me
 nothing about P21 at those pixels (it could be 0 or 1).

Therefore we need to retain all info on a subcube as we go to
 avoid further ANDing:

So, to answer the classification question (using our "nearest
 ngbr" like approach) we need to have filled in a cube:

Consider, again, the new sample:  x = ---- 1011 1000 1111 and
C = {0010 0011 0111 1010 1111} (class labels).

We need the cube bounded by all of 5 B1-values
 (the entire B1 dimension) and
P2,
   1011            [11,11]
   101   1100      [10,12]
   1001  1101      [ 9,13]
   1     111       [ 8,14]
   0111  1111      [ 7,15]
   0110            [ 6,15]
   0101            [ 5,15]
   01              [ 4,15]
   0011            [ 3,15]
   0010            [ 2,15]
   0001            [ 1,15]

Of these, the ones we see in the basic algorithm
 (removal of loworder bits) are:
P2,
   1011            [11,11]
   101             [10,11] not seen above
   10              [ 8,11] not seen above
   1               [ 8,15] not seen above

If we also include those needed to balance the intervals:
P2,
   1011            [11,11]
   101  1100       [10,12] not seen above
   10   111        [ 8,14] not seen above
   1               [ 8,15] not seen above

How far out should the intervals go before we stop
    (and consider the new sample an outlier - at which point
     we take the majority class of the ngbr-set, if there is
     one, else take the majority class of the sample space)?

   - One thought would stop after Radius =
                 ROOF{SQRT(|S|) / ROOF[SQRT(|S|/k)]}

     Rationale:  If the samples are uniformly distributed with
     duplicity=k, each duplicity group would be at and
     intersection of grid lines with the above spacing.

   - SQRT(|S|/k) = SQRT(16/5) = 1.78, roof is 2.
     so R = 4 / 2 = 2
P2,
   1011            [11,11]
   101  1100       [10,12]
   10   111        [ 8,14]

P3,
   1000            [ 8, 8]
        0111  1001 [ 7, 9]
        011   1010 [ 6,10]

P4,
   1111            [15,15]
        111   1111 [14,15]
        11    1111 [13,15]

then once the ngbrset is found, AND with the following
 to classify
P1,0010
   0011
   0111
   1010
   1111



Misc Classification   















Cluster Analysis



What is Cluster Analysis?

- The process of grouping a set of physical or abstract objects
  into classes of similar objects.


A Powerpoint presentation on Clustering What are some typical applications of clustering? - Business: Help marketers discover distinct groups in their customer bases and characterize customer groups - Biology: Derive plant and animal taxonomies; Categorize genes with similar functionality; Gain insight into structures inherent in populations; - Land use: Identify areas of similar land use in an earth observation database; - Insurance: Identify groups of houses in a city according to house type, value, geographic location; Identify policy holders with average claim costs - WWW: Classify documents on the web for information discovery. - Data Mining: Stand-alone tool to gain insight into the distribution of data, Observe characterisitics of each cluster; - Data clustering includes contributions from data mining, statistics, machine learning, spatial databases, biology, marketing. - As a branch of statistics, cluster analysis has been studied extensively - focused mainly on distance-based cluster analysis, - tools are built into S-Plus, SPSS, SAS - In machine learning, cluster analysis is an example of unsupervised learning. - does not rely on predefined classes or class-labeled training examples - form of learning by observation, rather than learning by example, - In conceptual clustering, a group of objects forms a class only if it is describable by a concept. - differs from conventional clustering, which measures similarity, based on distance. - Conceptual Clustering consists of two components: (1) it discovers the appropriate classes 2) it forms descriptions for each class, as in classification - The guideline of striving for high interclass similarity and low interclass similarity still applies. - In data mining, cluster analysis has focused on: - finding methods for efficient and effective cluster analysis in large databases, - scalability of clustering methods, - effectiveness of methods for clustering complex shapes and types of data, - high-dimensional clustering techniques, - methods for clustering mixed numerical and categorical data in large DBs - The following are typical requirements of clustering in data mining: - Scalability (cluster larger datasets in reasonable time?) - Deal with different types of attributes (binary, categorical (nominal), ordinal, mixtures) - Discovery of clusters with arbitrary shape (Euclidean or Manhattan distance produces spherical clusters with similar size and density) What about arbitrary shapes? ("spatial clustering" deals with shaped clusters) - Minimal requirements for domain knowledge to determine input parameters: (parameters such as # of clusters may be hard to determine apriori) Cluster algorithm should be robust and insensitive wrt to the inputs - Ability to deal with noisy data (a "noise" point is also called an "outlier") (insensitivity to outliers, missing data, unknowns, erroneous data..) - Insensitivity to order of input records - High dimensionality (human eye not good at judging cluster quality for more than 3 dimensions) - Contraint-based clustering (there may be side constraints as well as a "distance" ("spatial clustering" deals with side conditions also) - Interpretability and usability (user expect interpretable comprehensible an usable results) - Study of clustering methods proceeds as follows: present general categorizatoin of clustering methods, study each method in detail, including methods based on partitioning, hierarchical, density-based, grid-based, model-based examine high-dimensionality and do outlier analysis Types of Data That Occur in Cluster Analysis (and how to preprocess them) - Suppose dataset to be clustered contains n objects, which may represent persons, houses, documents, countries, pixels, genes... - Clustering algorithms typically operate on either a "data matrix" or a "dissimilarity matrix" - Data Matrix (or object-by-variable structure or "two mode"): - represents n objects (persons?) by p variables (measurements or attributes) (such as height, weight, gender, race...). The structure is in the form of a relational table or n-by-p matrix (n objects, p variables) - in our spatial DM the objects are pixels and the attributes are bands. x11 x12 ... x1p x21 x22 ... x2p . . . : : : xn1 xn2 ... xxp - This the the relational or table view of the data, R(K,A1,...,Ap) where K is a key id attribute to identify objects uniquely and each Ai is a column in the Matrix. - in spatial DM this is just the REL organization in which each row is a tuple corresponding to a particular pixel. - Dissimilarity Matrix (or object-by-object structure or "one mode"): - Stories collection of proximities available for all pairs of n objects. Often represented by an n-by-n table: 0 d(2,1) 0 . . : : d(n,1) d(n,2)... 0 - Where d(i,j) is measured difference or dissimilarity between objects i & j. - d(i,j) is a non-neg number close to 0 when objects are similar or "near". - in our precision ag example, a distance measure might be: the distance between two tuples, t and t', is |2*t.Y + t.SM + t.N - (2*t'.Y + t'.SM + 't.N)| This was used, essentially, by Kaushik Das in his thesis work. We will look at his clustering software later (based on SOMs and NNs). - Many clustering algorithms operate on a dissimilarity matrix, but a data matrix can be transformed into a dissimilarity matrix. - in the spatial setting, this can be a prohibitively large matrix: - For a TM scene, it is ~(40,000,000)^2 /2 or 800,000,000,000,000 cells (800 trillion!) Interval-scaled Variables (continuous of linear scale: weight,height,lat,lon,..) - units can effect clustering results (inches versus meters) - smaller units lead to larger ranges for that variable and therefore larger clustering effect for that variable. - To avoid units effects, data should be standardized - convert to "unitless" measurements by: (1) Calculate the mean absolute deviation for a variable (attribute), f, sf=(|x1f-mf|+|x2f-mf|+...+|xnf-mf|)/n mf=mean of f = (x1f+..+xnf)/n (2) Calculate the standardized measurement, or z-score: zif=(xif-mf)/sf sf=mean absolute deviation (dis from mean is not squared) (3) median absolute deviation... - Once standardized, similarity/dissimilarity calculated based on "distance": (1) Euclidean: d(i,j)=SQRT(|xi1-xj1|^2 +..+ |xip-xjp|^2) (2) Manhattan: d(i,j)= (|xi1-xj1| +..+ |xip-xjp| ) Both are reflexive ( d(i,i)=0 ), symmetric ( d(i,j)=d(j,i) ) and subtransitive ( satisfy triangle inequality d(i,j) <= di,h)+d(h,j) ) (3) Minkowski (generalization of both): d(i,j)=(|xi1-xj1|^q +..+ |xip-xjp|^q)^1/q (4) weighted Minkowski: d(i,j)=(w1*|xi1-xj1|^q +..+ wp*|xip-xjp|^q)^1/q A Categorization of Major Clustering Methods Partitioning Methods - Given a DB or n objects (tuples), a partitioning method constructs k partitions of the data, where each partition represents a cluster and k<=n - ie, classify into k groups, that together satisfy: (1) each group must contain >=1 object, (2) each object must belong to 1 group (partitions are mutially exclusive and collectively exhaustive) (can be relaxed to a fuzzy partition) - Given k (# partitions to construct) create initial partition, then use iterative relocation technique that attempts to improve partitioning by moving objects. - General criteria for good partitioning is that same-cluster objects are "close" and different-cluster objects are "far apart" - To achieve global optimality would require exhaustive enumeration of all posssibilities. - Heuristics: (1) k-means algorithm, where each cluster is represented by the mean value of its objects (2) k-medoids algorithm, where each cluster is represented by an object near the center (center= 1st moment - minimizes the sum of the distances from it to its cluster mates. center = 2nd moment, etc.) - Works well finding spherical clusters in small-medium sized datasets. Hierarchical Methods - Agglomerative (bottom-up) (starts with each object in its own cluster) Divisive (top-down) (starts with all objects in one cluster) Agglomerative step0 step1 step2 step3 step4 (AGNES) -----+----------+----------+----------+----------+-- > a--------. ab-----------------------------. b--------' abcde c-------------------------------cde------' d-------------------. / de-------' e-------------------' Divisive step4 step3 step2 step1 step0 (DIANA) <-----+----------+----------+----------+----------+---- AGNES (AGglomerative NESting) places each object in its own cluster initially. 2 clusters are merged iteratively according to some criterion, usually minimum cluster distance (see options for distance between 2 clusters below). DIANA (DIvisive ANAlysis) all objects form one cluster initially. Clusters are split according some principle, usually maximum pairwise cluster distance. In either user can specify desired number of clusters as a termination condition. Four widely used cluster distances are: (where |p-q| is distance between objects) 1. Minimum distance: Dmin(Ci,Cj) = min(p in Ci, q in Cj)|p-q| 2. Maximum distance: Dmax(Ci,Cj) = max(p in Ci, q in Cj)|p-q| 3. Mean distance: Dmean(Ci,Cj) = |mi - mj| 4. Average distance: Davg(Ci,Cj) = 1/(ni*nj) SUM(p in Ci)SUM(q in Cj)|p-q| - Suffer from the fact that once a split or merge is done, it cannot be undone (result in error?). - Improvements: (1) perform careful analysis of object linkages at each hierachical clustering (CURE and Chameleon) (2) integrate hierarchical agglomerative and iterative relocation by 1st using a hierarchical agglomerative algorithm & then refining the result using iterative relatcation (BIRCH) Density-based methods (non-distance based). - continue growing the given cluster as long as the density (# of objects or data points in the nghd) exceeds some theshold (DBSCAN, OPTICS) Grid-based Methods (all clustering operations are performed on a grid structure) - fast processing indepedent of # of data objects, and dependent only on # cells in each dimension of the quantitized space. (STING, CLIQUE, WaveClsuter) Model-based Methods (hypothesize a model for each of the clusters, and finds the best fit of the data to the given model. Should Ptrees lend themselves to a grid-based clustering methods? (Since the recursive quadrantization is a griding of the space) or is the griding usually on other than the key attribute?) However, if we grid (quadrantize) on the other attributes the resulting structure should serve the grid approach to clustering well. - Construct a Ptree of the Pcube? ____________________________ / / / / /| 3 =11 / / / / / | / / / / / | /______/______/______/______/ | / / / / /| | 2 =10 / / / / / | /| / / / / / | / | /______/______/______/______/ |/ | / / / / /| / | 1 1 =01 / / / / / | /| /| d /4------ >5 / / / | / | / | n /_^____/__.___/______/______/ |/ |/ | a / : / . / / /| | | | B 0 =00 / . / etc / / / | /| /| /| / . / / / | / | / | / | /______/_.____/______/______/ |/ |/ |/ | B | | . | | | | | | / a 0 = 00 | | . | | | /| /| /| / n | 0- >1 . | | | / | / | / | / d |______|/____:|______|______|/ |/ |/ |/ 2 | / :| | | | | / 1 = 01 | /| :| | | /| /| / | 2----- >3| | | / | / | / |______|______|______|______|/ |/ |/ | | | | | | / 2 = 10 | | | | | /| / | | | | | / | / |______|______|______|______|/ |/ | | | | | / 3 = 11 | | | | | / | | | | | / |______|______|______|______|/ 0 =00 1 =01 2 =10 3 =11 Band3 Gives a Ptree with fanout=8 (focus on the rootcounts of each tree only). Root .--------------------------'/// \\\`--------------------------. / .-----------------'// \\`-----------------. \ / / .--------'/ \`--------. \ \ / / / / \ \ \ \ P(0,0,0) P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1) ///||\\\ ///||\\\ ///||\\\ ///||\\\ ///||\\\ ///||\\\ ///||\\\ ///||\\\ /// || \\\ /// || \\\ .--------------------------'// / \ \\`-------------------. / .----------------'/ / \ \`--------------. \ / / .------' / \ `---. \ P(11,01,01) / / / / \ \ \ P(10,00,00)P(10,00,01)P(10,01,00)P(10,01,01) P(11,00,00)P(11,00,01)P(11,01,00) We certainly can look for grid based clusters in this tree, but it is LARGE in general. If we are interested only in "dense clusters" we could place 1 in a node only the octant has more than, e.g., twice its share (i.e., at depth-1: more than 1/4 of total count) etc. - then we have a Boolean tree which should identify clusters. - how about a 1-bit iff the octant has more than its share?? (compression not as good?) Partitioning Methods (more detail) - Given a database with n objects (tuples) and k=# of clusters to form, a partition aglorithm organizes objects into k partitions, where each partition represents a cluster. - The clusters are formed to optimize an objective partitioning criterion, often called a "similarity function", (e.g., distance) so that objects within a cluster are similar and objects of different clusters are dissimilar in terms of the database attributes. Classical Partitioning Methods: k-means and k-medoids The most well-known and commonly used partitioning methods are these. Algoritm: (k-means: based on the mean value of the objects in the cluster) Input: The number of clusters, k, and a database containing n objects. Output: A set of k clusters that minimize the squared-error criterion. Method: (1) arbitrarily choose k objects as the initial cluster centers. (2) repeat (3) (re)assign each object to the cluster to which the object is most similar based on the mean value of the objects in the cluster; ( using E=SUM(i=1..k)[ SUM(p in Ci)[|p-mi|^2]] where mi=mean of Ci ) (4) update the cluster means, i.e., calculate the mean value of the objects for each cluster; (5) until no change; P-trees might be used in applying k-means-CPM as follows. Data: X-Y B1 B2 B3 B4 0,0 0011 0111 1000 1011 0,1 0011 0011 1000 1111 0,2 0111 0011 0100 1011 0,3 0111 0010 0101 1011 1,0 0011 0111 1000 1011 1,1 0011 0011 1000 1011 1,2 0111 0011 0100 1011 1,3 0111 0010 0101 1011 2,0 0010 1011 1000 1111 2,1 0010 1011 1000 1111 2,2 1010 1010 0100 1011 2,3 1111 1010 0100 1011 3,0 0010 1011 1000 1111 3,1 1010 1011 1000 1111 3,2 1111 1010 0100 1011 3,3 1111 1010 0100 1011 If we consider only 4-bit values and the corresponding P-trees: P1,0000 P1,0100 P1,1000 P1,1100 P1,0010 P1,0110 P1,1010 P1,1110 0 0 0 0 3 0 2 0 0 0 3 0 0 0 1 1 3 3 0 P1,0001 P1,0101 P1,1001 P1,1101 P1,0011 P1,0111 P1,1011 P1,1111 0 0 0 0 4 4 0 3 4 0 0 0 0 4 0 0 0 0 0 3 --------------