Machine Learning (ML)

is an older term for Data Mining, which included 2, CLASSIFICATION and CLUSTERING,

of the 3 Data Mining areas of: Assoc. Rule Minning, Classification and Clustering.

A still older term, Artificial Intelligence (AI), included all of these and much more.







CLASSIFICATION is the central area of the three!







Given a (large) TRAINING SET T(A1, ..., An, C) with  CLASS    C
                                               and  FEATURES  A&equiv(A1,...,An)
C-CLASSIFICATION of an unclassified
sample, (a1,...,an) is just:

           SELECT    Max (Count (T.Ci))
           FROM      T

           WHERE     T.A1 = a1  
           AND       T.A2 = a2   
           ... 
           AND       T.An = an

           GROUP BY  T.C;
 
i.e., just a SELECTION, since C-Classification is assigning to (a1..an)
                             the most frequent C-value in RA=(a1..an).




But, if the EQUALITY SELECTION is empty,
     then we need a FUZZY QUERY to find NEAR NEIGHBORs (NNs)
                                instead of exact matches. 

That's Nearest Neighbor Classification (NNC).




If SQL had a good Nearest Neighbor Set operator, we would be done.
But it doesn't, so NNC is essentially building a good NEAR NEIGHBOR operator.







E.g.,

Medical Expert System (Ask a Nurse): Symptoms plus past diagnoses
                                     are collected into a table called CASES

For each undiagnosed new_symptoms,
CASES is searched for matches:             SELECT DIAGNOSIS
                                           FROM   CASES
                                           WHERE  CASES.SYMPTOMS = new_symptoms;
If     there is a predominant DIAGNOSIS,
Then   report it,

ElseIf there's no predominant DIAGNOSIS,
Then   Classify instead of Query, i.e.,
       find fuzzy matches (near nbrs)      SELECT DIAGNOSIS
                                           FROM   CASES
                                           WHERE  CASES.SYMPTOMS ≅ new_symptoms
Else   call your doctor in the morning

       That's exactly (Nearest Neighbor) Classification!!






CAR TALK radio show: Click and Clack the Tappet brothers have a vast
      TRAINING SET of car problems and solutions built from experience.

      They search that TRAINING SET for close matches to predict solutions
           based on previous successful cases.

      That's exactly (Nearest Neighbor) Classification!!






We all perform Nearest Neighbor Classification every day of our lives.

E.g., We learn when to apply specific programming/debugging techniques so that
      we can apply them to similar situations thereafter.






COMPUTERIZED NNC &equiv MACHINE LEARNING

                 (most clustering (which is just partitioning) is done as
                                   a simplifying prelude to classification).









Again, given a TRAINING SET, R(A1,..,An,C), with C=CLASSES and (A1..An)=FEATURES

Nearest Neighbor Classification (NNC) &equiv

    selecting a set of R-tuples with similar features (to the unclassified sample)

            and then letting the corresponding class values vote.





Nearest Neighbor Classification won't work very well if
                 the   vote is inconclusive (close to a tie)
                 or if similar (near) is not well defined, then we

                 build a MODEL of TRAINING SET
                                (at, possibly, great 1-time expense?)






When a MODEL is built first the technique is called Eager classification,

whereas

model-less methods like Nearest Neighbor are called Lazy or Sample-based.









Eager Classifiers models can be:

                              decision trees,
                              probabilistic models (Bayesian Classifier), 
                              Neural Networks,
                              Support Vector Machines, ...








How do you decide when an EAGER model is good enough to use?
How do you decide if a Nearest Neighbor Classifier is working well enough?




We have a TEST PHASE.

    typically, we set aside some training tuples as a Test Set.
    (then, of course, those Test tuples cannot be used in model building or
                                    and cannot be used as nearest neighbors)
   
If the classifier passes the the test
(a high enough % of Test tuples are correctly
 classified by the classifier) it is accepted.











EXAMPLE 1:

Computer Ownership TRAINING SET for predicting who owns a computer:

 Customer  Age   Salary   Job           Owns Computer
       1 |  24 | 55,000 | Programmer  | yes
       2 |  58 | 94,000 | Doctor      | no 
       3 |  48 | 14,000 | Laborer     | no 
       4 |  58 | 19,000 | Domestic    | no 
       5 |  28 | 18,000 | Construction| no 

A classifier might be built from this TRAINING SET (e.g., a decision tree) as follows:

                Age < 30
                /       \
              T           F
             /             \
      Salary > 50K          No
        /        \
      T            F
     /              \
 Yes                 No              

It is easy to determine a pattern in this small dataset, however for large
 datasets it is impossible to construct a decision tree model by "sight".

Therefore we need a Model Building Algorithm or training algorithm








EXAMPLE 2:

PRECISION AG YIELD CLASSIFIER predicts YIELD of a field grid cell
 based on mid-year Blue, Green, Red, NearInfraRed reflectances from that cell.
 The TRAINING SET is R(CELL, Blue, Green, Red, NIR, YIELD) from previous year.

1st Separate out a Test Set.

2nd Build a CLASSIFIER MODEL (decision tree) from remaining TRAINING SET

3rd Test MODEL accuracy using the Test Set.  If it passes the test,

       then when an aerial photo is taken during the growing season,
            predict where low yeild can be expected using the MODEL
            (then apply additional nutrients to those cells?)





TRAINING SET

 X  Y   Blue_____   Green____    Red_____    NIR_____   YIELD_
 0  0 | 0000 1001 | 1010 1111 | 0000 0110 | 1111 0101 | medium  
 0  1 | 0000 1011 | 1011 0100 | 0000 0101 | 1111 0111 | medium  
 0  2 | 0000 1011 | 1011 0101 | 0000 0100 | 1111 0111 | high   
 0  3 | 0000 0111 | 1011 0111 | 0000 0011 | 1111 1000 | high  
 0  4 | 0000 0111 | 1011 1011 | 0000 0001 | 1111 1001 | high 
 0  6 | 0000 1000 | 1011 1111 | 0000 0000 | 1111 1011 | high
 1  0 | 0001 1101 | 1010 1110 | 0000 0111 | 1111 0100 | medium  
 2  1 | 0000 1111 | 1011 0101 | 0000 0110 | 1111 0110 | medium 
 3  2 | 0001 1111 | 1011 0111 | 0000 0101 | 1111 0110 | medium
 4  3 | 0001 1111 | 1011 0110 | 0000 0010 | 1111 1000 | high 
 5  4 | 0001 1111 | 1011 1010 | 0000 0010 | 1111 1000 | high
 6  6 | 0001 1111 | 1011 1110 | 0000 0001 | 1111 1010 | high    





Separate out as TEST SET

 X  Y   Blue_____   Green____    Red_____    NIR_____   YIELD_
 1  0 | 0001 1101 | 1010 1110 | 0000 0111 | 1111 0100 | medium  
 2  1 | 0000 1111 | 1011 0101 | 0000 0110 | 1111 0110 | medium 
 3  2 | 0001 1111 | 1011 0111 | 0000 0101 | 1111 0110 | medium
 4  3 | 0001 1111 | 1011 0110 | 0000 0010 | 1111 1000 | high 
 5  4 | 0001 1111 | 1011 1010 | 0000 0010 | 1111 1000 | high
 6  6 | 0001 1111 | 1011 1110 | 0000 0001 | 1111 1010 | high
 
                           
                           


TRAIN a Classifier with the remainder (a decision tree)

REMAINDER of the TRAINING SET

 X  Y   Blue_____   Green____    Red_____    NIR_____   YIELD_
 0  0 | 0000 1001 | 1010 1111 | 0000 0110 | 1111 0101 | medium 
 0  1 | 0000 1011 | 1011 0100 | 0000 0101 | 1111 0111 | medium  
 0  2 | 0000 1011 | 1011 0101 | 0000 0100 | 1111 0111 | high 
 0  3 | 0000 0111 | 1011 0111 | 0000 0011 | 1111 1000 | high  
 0  4 | 0000 0111 | 1011 1011 | 0000 0001 | 1111 1001 | high   
 0  6 | 0000 1000 | 1011 1111 | 0000 0000 | 1111 1011 | high    



                    ____________________________________
                   /                   |                \
                 /                     |                  \
               /                       |                    \
             /                         |                      \
   NIR ≤ 01000000          01000000 < NIR ≤ 11110111        11110111 < NIR
 ^ Red ≥ 00100000        ^ 00100000 > Red ≥ 00000101      ^ 00000101 > Red
       /                               |                           \
     /                                 |                             \
YIELD = low                      YIELD = medium                   YIELD = high

                                             




TEST Classifier            

TEST SET

 X  Y   Blue_____   Green____   Red_____    NIR_____    YIELD_   PREDICTED YIELD 
 1  0 | 0001 1101 | 1010 1110 | 0000 0111 | 1111 0100 | medium  | medium
 2  1 | 0000 1111 | 1011 0101 | 0000 0110 | 1111 0110 | medium  | medium
 3  2 | 0001 1111 | 1011 0111 | 0000 0101 | 1111 0110 | medium  | medium
 4  3 | 0001 1111 | 1011 0110 | 0000 0010 | 1111 1000 | high    | high
 5  4 | 0001 1111 | 1011 1010 | 0000 0010 | 1111 1000 | high    | high
 6  6 | 0001 1111 | 1011 1110 | 0000 0001 | 1111 1010 | high    | high

Tests out to be 100% correct (Gets and A grade!).





USE Classifier Model (decision tree) to classify:

New Data: R,G,B,NIR from an aerial image taken on ~4th of July:


 X  Y   Blue_____   Green____  Red______   NIR______
 0  6 | 0001 1100 | 1011 1110  0000 0001 | 1111 1110
                              
                    ___________________ ===================  
                   /                   |                    \\ 
                 /                     |                     \\ 
               /                       |                      \\ 
             /                         |                       \\ 
   NIR ≤ 01000000          01000000 < NIR ≤ 11110111        1111 0111 < NIR
 ^ Red ≥ 00100000        ^ 00100000 > Red ≥ 00000101      ^ 0000 0101 > Red
       /                               |                          \\
     /                                 |                           \\ 
YIELD = low                      YIELD = medium                   YIELD = high


                                             
















Preparing Data for Classification


  Data Cleaning (of noise and missing values)

     Remove Noise (or reduce noise) by "smoothing"

     Fill in missing values (with most common or some statistical value)

                 NOTE: Even Noise and Missing Value management can be done by a
                       Nearest Neighbor Vote!  (called interpolation)

     Feature Extraction to eliminate irrelevant attributes (e.g., in the PA example,
                 eliminate Blue, Green since they're irrelevant to the decision).









Ways of Comparing Different Classification Methods

     Predictive Accuracy (predicting the class label of new data)

     Speed (computation costs for generating and using the model)

     Robustness (does it give almost the same predictions when
                 the Training Set are almost the same?)

     Scalability (Model construction efficiency - massive datasets)










More Detail on Some Classification Methods:






K-Nearest-Neighbor Classification   







Decision Tree Models for EAGER CLASSIFICATION:

                     each inode is a test on a feature attribute (composite?),

                     each test outcome is assigned a link to the next level
                                   (outcome=a value or range of values or...)
                     each leaf represent a class (or distribution of classes)



Unknown sampes are classified by their testing feature attributes against the tree.


The leaf arrived at, holds the class prediction for that sample.

     Some branches may represent noise or outliers (and should be pruned?)


The ID3 algorithm for inducing a decision tree from training tuples is:


   1. The tree starts as a single node containing the entire TRAINING SET.

   2. If all TRAINING TUPLES have the same class, this node is a leaf. DONE.

   3. Otherwise, use a measure, information gain, as a heuristic for
      selecting the best decision attribute for that node

   4. Branch is created for each value [interval of values] of that test attribute
      and the TRAINING SET is partitioned accordingly.

   5. Recurses on 2,3,4, until The Stopping Condition is true.






Possible Stopping Conditions:

All samples for a given node belong to the same class (label with that class)

∃ no remaining candidate decision attributes (label with plurality class).

Some other stopping rule.










Information Gain as an Attribute Selection Measure

           Minimizes expected number of tests needed to classify an object
           and guarantees simple tree (not necessarily the simplest)


At any stage, let

S    = {s1,...,sm} be a TRAINING SUBSET.

S[C] = {C1,...,Cc} be the distinct classes in S.




EXPECTED INFORMATION needed to classify a sample given S as TRAINING SET is:

I{s1,...,sm} = -∑i=1..mpi*log2(pi)     pi= |S∩Ci|/|S|


Choosing A as decision attribute, the
Expected Classification Info gained is


E(A) = ∑j=1..v; i=1..m ( si,j/|S| * I{sij..smj} )  where Skh = SA=ak∩Ch




Gain(A) = I(s1..sm) - E(A)

   - expected reduction of info required to classify
     after splitting via A-values.

The algorithm above computes the information gain of each
 attribute and selects the one with the highest information gain
 as the test attribute.

Branches are created for each value of that attribute and samples
 are partitioned accordingly.



Pruning
=======

When a decision tree is built, many of the branches will reflect
 anomalies in the training data due to noise or outliers.

Tree pruning methods address this problem of "overfitting" the
 data (classifying situations that are erroneous or accidental).

Such methods typically use statistical measures to remove the
 least reliable branches, resulting in faster classification and
 an improvement in the ability of the tree to corredtly classify
 independent test data.




Extracting Classification Rules from Decision Tress

One rule per each path from root to leaf.

Each (attr,value) along path forms a conjunction in the antecedent
   
Leaf holds class prediction or consequent.

May be easier for humans to understand rules.

    

More on Decision Tree Induction (powerpoint Introduction)   







Note that the notion of "near" requires a distance or similarity measure to exist.
What are some of them?

Metrics (distance functions on feature space)   







The example:

Training Data:


Band B1:      Band B2:      Band B3:      Band B4:
 3  3  7  7    7  3  3  2    8  8  4  5   11 15 11 11
 3  3  7  7    7  3  3  2    8  8  4  5   11 11 11 11
 2  2 10 15   11 11 10 10    8  8  4  4   15 15 11 11
 2 10 15 15   11 11 10 10    8  8  4  4   15 15 11 11

S:  
X-Y  B1   B2   B3   B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011


Suppose that B1 is the class label attribute (e.g., Yield)
Then the class labels are 2, 3, 7, 10, 15 (C1,..,C5).

We need to know the count of the number of pixels (rows in
 the table above) that contain each value in each attribute.

We also need to know the count of pixels that contain pairs of
 values, one from a descriptive attribute and the other from the
 class label attribute.

Moreover we may wish to focus on only a portion of the dataset
 (some part of the field) before making those count calculations.

The Ptree structure is perfect for providing those counts.


B11  B12  B13  B14
0000 0011 1111 1111
0000 0011 1111 1111
0011 0001 1111 0001
0111 0011 1111 0011

BASIC_PTREES_band1___________________
P1,1      P1,2      P1,3      P1,4
5         7         16        11
0 0 1 4   0 4 0 3             4 4 0 3
    3           0                   0  <-where "different" bit is

VALUE_PTREES_band1___________________ (2-bit precision, 3, etc):
P1(00)    P1(01)    P1(10)    P1(11)
7         4         2         3    
4 0 3 0   0 4 0 0   0 0 1 1   0 0 0 3
    3                   3 0         0

P1(000) P1(010) P1(100) P1(110) P1(001)  P1(011)  P1(101)  P1(111)
0       0       0       0       7        4        2        3      
                                4 0 3 0  0 4 0 0  0 0 1 1  0 0 0 3 
                                    3                 3 0        0

P1(0000 P1(0100 P1(1000 P1(1100 P1(0010  P1(0110  P1(1010  P1(1110
0       0       0       0       3        0        2        0      
                                0 0 3 0           0 0 1 1        
                                    3                 3 0      
P1(0001 P1(0101 P1(1001 P1(1101 P1(0011  P1(0111  P1(1011  P1(1111
0       0       0       0       4        4        0        3      
                                4 0 0 0  0 4 0 0           0 0 0 3 
                                                                 0 


B21  B22  B23  B24
0000 1000 1111 1110 
0000 1000 1111 1110    
1111 0000 1111 1100  
1111 0000 1111 1100 
                   
BASIC_PTREES_band2___________________
P2,1      P2,2      P2,3        P2,4
8         2         16          10
0 0 4 4   2 0 0 0               4 2 4 0
          02                      02        <-positions of the
                                              two 1-bits
VALUE_PTREES_band2___________________
P2(00)    P2(01)    P2(10)      P2(11)
6         2         8           0      
2 4 0 0   2 0 0 0   0 0 4 4               
13        02

P2(000  P2(010  P2(100  P2(110 P2(001   P2(011   P2(101   P2(111
0       0       0       0      6        2        8        0      
                               2 4 0 0  2 0 0 0  0 0 4 4          
                               13       02

P2(0000 P2(0100 P2(1000 P2(1100 P2(0010  P2(0110  P2(1010  P2(1110
0       0       0       0       2        0        4        0      
                                0 2 0 0           0 0 0 4         
                                  13            
P2(0001 P2(0101 P2(1001 P2(1101 P2(0011  P2(0111  P2(1011  P2(1111
0       0       0       0       4        2        4        0      
                                2 2 0 0  2 0 0 0  0 0 4 0         
                                1302     02

B31  B32  B33  B34
1100 0011 0000 0001                           
1100 0011 0000 0001                         
1100 0011 0000 0000                        
1100 0011 0000 0000                       

BASIC_PTREES_band3___________________
P3,1      P3,2      P3,3      P3,4
8         8         0         2 
4 0 4 0   0 4 0 4             0 2 0 0
                                13

VALUE_PTREES_band3___________________
P3(00)    P3(01)    P3(10)    P3(11)
0         8         8         0      
          0 4 0 4   4 0 4 0               

P3(000) P3(010)   P3(100)  P3(110) P3(001) P3(011) P3(101) P3(111)
0       8         8        0       0       0       0       0      
        0 4 0 4   4 0 4 0                                                 

P3(0000 P3(0100   P3(1000  P3(1100 P3(0010 P3(0110 P3(1010 P3(1110
0       6         8        0       0       0       0       0      
        0 2 0 4   4 0 4 0                                                  
          02                                                               
P3(0001 P3(0101  P3(1001 P3(1101 P3(0011 P3(0111 P3(1011 P3(1111
0       2        0       0       0       0       0       0      
        0 2 0 0                                                            
          13


B41  B42  B43  B44
1111 0100 1111 1111         
1111 0000 1111 1111          
1111 1100 1111 1111            
1111 1100 1111 1111             

BASIC_PTREES_band4___________________
P4,1      P4,2      P4,3      P4,4
16        5         16        16
          1 0 4 0                    
          1                         

VALUE_PTREES_band4___________________
P4(00     P4(01     P4(10     P4(11
0         0         11        5      
                    3 4 0 4   1 0 4 0     
                    1         1

P4(000  P4(010  P4(100  P4(110  P4(001  P4(011  P4(101   P4(111
0       0       0       0       0       0       11       5      
                                                3 4 0 4  1 0 4 0
                                                1        1

P4(0000 P4(0100 P4(1000 P4(1100 P4(0010 P4(0110 P4(1010 P4(1110
0       0       0       0       0       0       0       0      
                                                                             
P4(0001 P4(0101 P4(1001 P4(1101 P4(0011 P4(0111 P4(1011   P4(1111
0       0       0       0       0       0       11        5      
                                                3 4 0 4   1 0 4 0
                                                1         1



Suppose we take this relation as training set (4-bit values).
Let B1 be the class label attribute.
Then the classes are:
                      { C1,C2,C3,C5,C5 } =
                      {  2, 3, 7,10,15 } where Ci={ci}.

The ID3 alg for inducing a decision tree from training samples:
S:  
X-Y  B1   B2   B3   B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011

1. Tree starts as one node representing the training samples, S.

2. If all samples are in same class (same B1-value)
   then S becomes a leaf with that class label.     [Not true!]

3. Else, use entropy-based, "information gain" as a heuristic for
   selecting the first decision attribute.


Take B2 = (a1,a2,a3,a4,a5} = { 2, 3, 7,10,11 }
           as the first candidate attribute.

        Aj={t:t(B2)=aj}, where a1=0010, a2=0011, a3=0111,
                               a4=1010, a5=1011.

        sij is number of samples of class, Ci, in a subset, Aj.

     so sij = rc( P1(ci)^P2(aj) ), where ci is in {2,3,7,10,15}
                                   and   aj is in {2,3,7,10,11}.

             ++---------+----------+----------+----------+--------
             || P2(2)   | P2(3)    | P2(7)    |P2(10)    |P2(11)      
             || 2       |  4       |  2       |  4       | 4
--.----------|| 0 2 0 0 |  2 2 0 0 |  2 0 0 0 |  0 0 0 4 | 0 0 4 0
ci|  P1(ci)  ||   13    |  1302    |  02      |          |    
==+==========++=========+==========+==========+==========+========
 2|  3       || 0       |  0       |  0       |  0       | 3      
  |  0 0 3 0 ||         |          |          |          | 0 0 3 0
  |      3   ||         |          |          |          |     3  
--+----------++---------+----------+----------+----------+--------
 3|  4       || 0       |  2       |  2       |  0       | 0      
  |  4 0 0 0 ||         |  2 0 0 0 |  2 0 0 0 |          |        
  |          ||         |  13      |  02      |          |        
--+----------++---------+----------+----------+----------+--------
 7|  4       || 2       |  2       |  0       |  0       | 0      
  |  0 4 0 0 || 0 2 0 0 |  0 2 0 0 |          |          |        
  |          ||   13    |    02    |          |          |        
--+----------++---------+----------+----------+----------+--------
10|  2       || 0       |  0       |  0       |  1       | 1      
  |  0 0 1 1 ||         |          |          |  0 0 0 1 | 0 0 1 0
  |      3 0 ||         |          |          |        0 |     3  
--+----------++---------+----------+----------+----------+--------
15|  3       || 0       |  0       |  0       |  3       | 0      
  |  0 0 0 3 ||         |          |          |  0 0 0 3 |        
  |        0 ||         |          |          |        0 |        
--+----------++---------+----------+----------+----------+--------


EXPECTED INFO needed to classify the sample is:

I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ],

m  = 5
s  = 16
si = 3,4,4,2,3   (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16

I  = -(3/16*lg2(3/16)+4/16*lg2(4/16)+4/16*lg2(4/16)+2/16*lg2(2/16)
     +3/16*lg2(3/16))

   = -(  -.453          -.5             -.5             -.375
         -.453       )

   = -( -2.281)      =       2.281    



ENTROPY based on the partition into subsets by B2 is

E(B2)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ]   where

Ij = I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj| 
           
Since m=5, the sij's are:
j=1        j=2        j=3        j=4        j=5  
---        ---        ---        ---        ---
0          0          0          0          3       <-- s1j
0          2          2          0          0       <-- s2j
2          2          0          0          0       <-- s3j
0          0          0          1          1       <-- s4j
0          0          0          3          0       <-- s5j
---        ---        ---        ---        ---
2          4          2          4          4       <- s1j+..+s5j



j=1        j=2        j=3        j=4        j=5  
---        ---        ---        ---        ---
2          4          2          4          4       <- |Aj|

where Aj's are the rootcounts of P2(aj)'s.



Therefore,
j=1        j=2        j=3        j=4        j=5  
---        ---        ---        ---        ---
0          0          0          0          .75    <-  p1j
0          .5         .5         0          0      <-  p2j
1          .5         0          0          0      <-  p3j
0          0          0          .25        .25    <-  p4j
0          0          0          .75        0      <-  p5j

and

j=1        j=2        j=3        j=4        j=5  
---        ---        ---        ---        ---
0*         0          0          0          -.311 <- p1j*log2(p1j)
0          -.5        -.5        0          0     <- p2j*log2(p2j)
0          -.5        0          0          0     <- p3j*log2(p3j)
0          0          0          -.5        -.5   <- p4j*log2(p4j)
0          0          0          -.311      0     <- p5j*log2(p5j)
--         ---        ---        -----      ----
0          1          -.5        .811       .811  <- I(s1j..s5j)
2          4          4          4          4     <- s1j+..+s5j

so that,

0          .25        -.125      .203       .203  (s1j+..+s5j)*
                                                   I(s1j..s5j)/16

                                           .531  E(B2)
                                          2.281  I(s1..sm)
                            GAIN(B2) - >  1.750  I(s1..sm)-E(B2)


NOTE: ONE CAN GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)

Footnote * (If pij = 0 why is p1j*log2(p1j) = 0?
            Hint: L'Hospital's Rule)



Continuing with B3
---------------------------------------------------------------

Take B3 = {a1,a2,a3} = {4,5,8} as the 2nd candidate attribute.

        Aj={t:t(B3)=aj}, where a1=0100, a2=0101, a3=1000,

        sij is number of samples of class, Ci, in a subset, Aj.

     so sij = rc( P1(ci)^P2(aj) ), where ci is in {2,3,7,10,15}
                                   and   aj is in {4,5,8}.

             ++---------+----------+----------+
             || P3(4)   | P3(5)    | P3(8)    |
             || 6       |  2       |  8       |
-------------|| 0 2 0 4 |  0 2 0 0 |  4 0 4 0 |
ci|  P1(ci)  ||   02    |    13    |          |
==+==========++=========+==========+==========+
 2|  3       || 0       |  0       |  3       |
  |  0 0 3 0 ||         |          |  0 0 3 0 |
  |      3   ||         |          |      3   |
--+----------++---------+----------+----------+
 3|  4       || 0       |  0       |  4       |
  |  4 0 0 0 ||         |          |  4 0 0 0 |
  |          ||         |          |          |
--+----------++---------+----------+----------+
 7|  4       || 2       |  2       |  0       |
  |  0 4 0 0 || 0 2 0 0 |  0 2 0 0 |          |
  |          ||   02    |    13    |          |
--+----------++---------+----------+----------+
10|  2       || 1       |  0       |  1       |
  |  0 0 1 1 || 0 0 0 1 |          |  0 0 1 0 |
  |      3 0 ||       0 |          |      3   |
--+----------++---------+----------+----------+
15|  3       || 3       |  0       |  0       |
  |  0 0 0 3 || 0 0 0 3 |          |          |
  |        0 ||       0 |          |          |
--+----------++---------+----------+----------+


EXPECTED INFO needed to classify the sample is the same as above:

I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ],

m  = 5
s  = 16
si = 3,4,4,2,3   (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16

I  = -(3/16*lg2(3/16)+4/16*lg2(4/16)+4/16*lg2(4/16)+2/16*lg2(2/16)
     +3/16*lg2(3/16))

   = -(  -.453          -.5             -.5             -.375
         -.453       )

   = -( -2.281)      =       2.281    



ENTROPY based on the partition into subsets by B3 is





E(B3)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ]   where

    I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj| 
           
The sij's are:
j=1       j=2        j=3
---       ---        ---
0          0          3      <-- s1j
0          0          4      <-- s2j
2          2          0      <-- s3j
1          0          1      <-- s4j
3          0          0      <-- s5j
---        ---        ---        
6          2          8      <- s1j+..+s5j

6          2          8      <- |Aj| (divisors)

0          0          .375   <-  p1j
0          0          .5     <-  p2j
.67        1          0      <-  p3j
.167       0          .125   <-  p4j
.5         0          0      <-  p5j

0          0          -.531  <- p1j*log2(p1j)
0          0          -.5    <- p2j*log2(p2j)
-.387      0          0      <- p3j*log2(p3j)
-.431      0          -.375  <- p4j*log2(p4j)
-.5        0          0      <- p5j*log2(p5j)
--         ---        ---    
1.318      0          1.406 <- I(s1j..s5j)=- sum of above
3          2          8      <- s1j+..+s5j

.247       0          .703   <- (s1j+..+s5j)*I(s1j..s5j)/16

                      .950  <-  E(B3) (sum of above)
                     2.281  <-  I(s1..sm)
          GAIN(B3) > 1.331  <-  I(s1..sm) - E(B3)



Continuing with B4=A={a1..av} used to classify S into {A1..Sv},
---------------------------------------------------------------

Take B4 = {a1,a2} = {11,15} as the 3rd candidate attribute.

        Aj={t:t(B4)=aj}, where a1=1101, a2=1111

        sij is number of samples of class, Ci, in a subset, Aj.

     so sij = rc( P1(ci)^P2(aj) ), where ci is in {2,3,7,10,15}
                                   and   aj is in {11,15}.

             ++---------+----------+
             || P4(11)  | P4(15)   |
             || 11      |  5       |
-------------|| 3 4 0 4 |  1 0 4 0 |
ci|  P1(ci)  || 1       |  1       | 
==+==========++=========+==========+
 2|  3       || 0       |  3       |
  |  0 0 3 0 ||         |  0 0 3 0 |
  |      3   ||         |      3   |
--+----------++---------+----------+
 3|  4       || 3       |  1       |
  |  4 0 0 0 || 3 0 0 0 |  1 0 0 0 |
  |          || 1       |  1       |
--+----------++---------+----------+
 7|  4       || 4       |  0       |
  |  0 4 0 0 || 0 4 0 0 |          |
  |          ||         |          |
--+----------++---------+----------+
10|  2       || 1       |  1       |
  |  0 0 1 1 || 0 0 0 1 |  0 0 1 0 |
  |      3 0 ||       0 |      3   |
--+----------++---------+----------+
15|  3       || 3       |  0       |
  |  0 0 0 3 || 0 0 0 3 |          |
  |        0 ||       0 |          |
--+----------++---------+----------+


EXPECTED INFO needed to classify the sample is the same as above:

I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ],

m  = 5
s  = 16
si = 3,4,4,2,3   (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16

I  = -(3/16*lg2(3/16)+4/16*lg2(4/16)+4/16*lg2(4/16)+2/16*lg2(2/16)
     +3/16*lg2(3/16))

   = -(  -.453          -.5             -.5             -.375
         -.453       )

   = -( -2.281)      =       2.281    



ENTROPY based on the partition into subsets by B4 is





E(B4)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ]   where

    I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj| 
           
The sij's are:
j=1       j=2        
---       ---       
0          3          <-- s1j
3          1          <-- s2j
4          0          <-- s3j
1          1          <-- s4j
3          0          <-- s5j
---        ---        
11         5          <- s1j+..+s5j

11         5          <- |Aj| (divisors)

0          .6         <-  p1j
.273       .2         <-  p2j
.364       0          <-  p3j
.091       .2         <-  p4j
.273       0          <-  p5j

0          -.442      <- p1j*log2(p1j)
-.511      -.464      <- p2j*log2(p2j)
-.531      0          <- p3j*log2(p3j)
-.315      -.464      <- p4j*log2(p4j)
-.511      0          <- p5j*log2(p5j)
--         ---        
1.868      1.37       <- I(s1j..s5j)= - sum of above
11         5          <- s1j+..+s5j

1.284      .428       <- (s1j+..+s5j)*I(s1j..s5j)/16

               1.712  <-  E(B4) (sum of above)
               2.281  <-  I(s1..sm)
    GAIN(B4) >  .568  <-  I(s1..sm) - E(B4)
and
    GAIN(B3) > 1.331  <-  I(s1..sm) - E(B3)
    GAIN(B2) > 1.750  <-  I(s1..sm) - E(B2)


Thus we select B2 as the first level decision attribute.


NOTE: WE GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)


4. Branches are created for each value of B2 and samples are
      partitioned accordingly (If a partition is empty, generate a
      leaf and label it with the most common class, C2,
      labeled with 0011).
                                                             
       .--- B2=0000 - > C2:0011                                       
       |--- B2=0001 - > C2:0011                              
       |--- B2=0010 - > Sample_Set_1                         
       |--- B2=0011 - > Sample_Set_2                         
       |--- B2=0100 - > C2:0011                              
       |--- B2=0101 - > C2:0011                              
       |--- B2=0110 - > C2:0011                              
  B2 --|--- B2=0111 - > Sample_Set_3                         
       |--- B2=1000 - > C2:0011                              
       |--- B2=1001 - > C2:0011                              
       |--- B2=1010 - > Sample_Set_4                         
       |--- B2=1011 - > Sample_Set_5                         
       |--- B2=1100 - > C2:0011                              
       |--- B2=1101 - > C2:0011                              
       |--- B2=1110 - > C2:0011                              
       `--- B2=1111 - > C2:0011                              

Sample_Set_1
X-Y  B1   B3   B4
0,3 0111 0101 1011
1,3 0111 0101 1011

Sample_Set_2
X-Y  B1   B3   B4
0,1 0011 1000 1111
0,2 0111 0100 1011
1,1 0011 1000 1011
1,2 0111 0100 1011

Sample_Set_3
X-Y  B1   B3   B4
0,0 0011 1000 1011
1,0 0011 1000 1011

Sample_Set_4
X-Y  B1   B3   B4
2,2 1010 0100 1011
2,3 1111 0100 1011
3,2 1111 0100 1011
3,3 1111 0100 1011

Sample_Set_5
X-Y  B1   B3   B4
2,0 0010 1000 1111
2,1 0010 1000 1111
3,0 0010 1000 1111
3,1 1010 1000 1111


NOTE WE DONT NEED TO LIST OUT THE SAMPLE_SETS IN ORDER TO CONTINUE


5. The Algorithm recurses to form decision tree for the samples at
   each partition.  Once an attribute is the decision attribute at
   a node, it is not considered further.

6. Stop when:
   a. All samples for a given node belong to the same class or
   b. no remaining attributes
         (label leaf with majority class among the samples)

We note all samples belong to the same class for nodes:
   Sample_Set_1, B2=0010, have class, C3:0111.
   Sample_Set_3, B2=0111, have class, C2:0011.


NOTE: ONE CAN GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)
  One can determine that these Sample_Sets contain only one
  B1 value (class label) from the Ptrees already computed:

             ++---------+----------+----------+----------+--------
             || P2(2)   | P2(3)    | P2(7)    |P2(10)    |P2(11)      
             || 2       |  4       |  2       |  4       | 4
--.----------|| 0 2 0 0 |  2 2 0 0 |  2 0 0 0 |  0 0 0 4 | 0 0 4 0
ci|  P1(ci)  ||   13    |  1302    |  02      |          |    
==+==========++=========+==========+==========+==========+========
 2|  3       || 0       |  0       |  0       |  0       | 3      
  |  0 0 3 0 ||         |          |          |          | 0 0 3 0
  |      3   ||         |          |          |          |     3  
--+----------++---------+----------+----------+----------+--------
 3|  4       || 0       |  2       |  2       |  0       | 0      
  |  4 0 0 0 ||         |  2 0 0 0 |  2 0 0 0 |          |        
  |          ||         |  13      |  02      |          |        
--+----------++---------+----------+----------+----------+--------
 7|  4       || 2       |  2       |  0       |  0       | 0      
  |  0 4 0 0 || 0 2 0 0 |  0 2 0 0 |          |          |        
  |          ||   13    |    02    |          |          |        
--+----------++---------+----------+----------+----------+--------
10|  2       || 0       |  0       |  0       |  1       | 1      
  |  0 0 1 1 ||         |          |          |  0 0 0 1 | 0 0 1 0
  |      3 0 ||         |          |          |        0 |     3  
--+----------++---------+----------+----------+----------+--------
15|  3       || 0       |  0       |  0       |  3       | 0      
  |  0 0 0 3 ||         |          |          |  0 0 0 3 |        
  |        0 ||         |          |          |        0 |        
--+----------++---------+----------+----------+----------+--------



   Thus the decision tree becomes:

       .--- B2=0000 - > C2:0011                                       
       |--- B2=0001 - > C2:0011                              
       |--- B2=0010 - > C3:0111                              
       |--- B2=0011 - > Sample_Set_2                         
       |--- B2=0100 - > C2:0011                              
       |--- B2=0101 - > C2:0011                              
       |--- B2=0110 - > C2:0011                              
  B2 --|--- B2=0111 - > C2:0011                              
       |--- B2=1000 - > C2:0011                              
       |--- B2=1001 - > C2:0011                              
       |--- B2=1010 - > Sample_Set_4                         
       |--- B2=1011 - > Sample_Set_5                         
       |--- B2=1100 - > C2:0011                              
       |--- B2=1101 - > C2:0011                              
       |--- B2=1110 - > C2:0011                              
       `--- B2=1111 - > C2:0011                              

Sample_Set_2 (for B2=0011)
X-Y  B1   B3   B4
0,1 0011 1000 1111
0,2 0111 0100 1011
1,1 0011 1000 1011
1,2 0111 0100 1011

Sample_Set_4 (for B2=1010)
X-Y  B1   B3   B4
2,2 1010 0100 1011
2,3 1111 0100 1011
3,2 1111 0100 1011
3,3 1111 0100 1011

Sample_Set_5 (for B2=1011)
X-Y  B1   B3   B4
2,0 0010 1000 1111
2,1 0010 1000 1111
3,0 0010 1000 1111
3,1 1010 1000 1111





Recursing the algorithm on Sample_Set_2 (B2=0011):


1. Subtree starts as single node, S = Sample_Set_2 (determined
   by B2=0011, so that ANDing with P2(3) gives correct counts).

2. Not all samples are in the same class (same B1-value),

3. So, use entropy-based measure, "information gain" as a
   heuristic for selecting the attribute that will best separate
   the samples into individual classes

NOTE: WE CAN GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)
We don't have to rescan the training_set to form the leaf
subsample_sets. We can just use the P-tree sets for those samples
That solves the problem (see 1. above).


Revising from 4. onward then (and expressing SubSampleSets as
revised P-trees):

       .--- B2=0000 - > C2:0011                                       
       |--- B2=0001 - > C2:0011                              
       |--- B2=0010 - > C3:0111                              
       |--- B2=0011 - > Sample_Set_2                         
       |--- B2=0100 - > C2:0011                              
       |--- B2=0101 - > C2:0011                              
       |--- B2=0110 - > C2:0011                              
  B2 --|--- B2=0111 - > C2:0011                              
       |--- B2=1000 - > C2:0011                              
       |--- B2=1001 - > C2:0011                              
       |--- B2=1010 - > Sample_Set_4                         
       |--- B2=1011 - > Sample_Set_5                         
       |--- B2=1100 - > C2:0011                              
       |--- B2=1101 - > C2:0011                              
       |--- B2=1110 - > C2:0011                              
       `--- B2=1111 - > C2:0011                              


For Sample_Set_2 (for B2=0011=3) (only 2 classes have count>0)
-------------++-------+-------+----------++---------+---------.
 \           ||P3(4)  |P3(5)  |P3(8)     ||P4(11)   |P4(15)   |
   \         ||6      |2      |  8       ||11       |5        |
    `-------.||0 2 0 4|0 2 0 0|  4 0 4 0 ||3 4 0 4  |1 0 4 0  |
             \|  02   |  13   |          ||1        |1        |
ci|P1(ci)^P2(3)=======+=======+==========++=========+=========|
 3|  2       ||0      |0      |  0       ||1        |0        |
  |  2 0 0 0 ||       |       |          ||1 0 0 0  |         |
  |  13      ||       |       |          ||1        |         |
--+----------++-------+-------+----------++---------+---------|
 7|  2       ||2      |0      |  0       ||2        |0        |
  |  0 2 0 0 ||0 2 0 0|       |          ||0 2 0 0  |         |
  |    02    ||  02   |       |          ||  02     |         |
--+----------++-------+-------+----------++---------+---------

EXPECTED INFO needed to classify the sample:
I = I(s1,s2) = -SUM(i=1,2)[ pi * log2(pi) ],
m  = 2      s  = 16
si = 2,2   (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 1/8, 1/8  
I  = -(2/16*lg2(2/16) + (2/16*lg2(2/16))
   = -(  -.375          -.375          )    =    .750

________________________________________________________

ENTROPY based on the partition into subsets by B3 is

Take B3 = {a1,a2,a3} = {4,5,8} as the 1st candidate attribute.

   Aj={t:t(B3)=aj}, where a1=0100, a2=0101, a3=1000,
   sij is number of samples of class, Ci, in a subset, Aj.
   sij=rc(P1(ci)^P2(aj))  where   ci in {3,7}  and   aj in {4,5,8}

ENTROPY based on the partition into subsets by B3 is

E(B3)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ]   where

    I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj| 
           
The sij's are:
j=1       j=2        j=3
---       ---        ---
0          0          0      <-- s1j
2          0          0      <-- s2j
---        ---        ---        
2          0          0      <- s1j+..+s3j

2          0          0      <- |Aj| (divisors)

0          undefined  undefined  <-  p1j
1          undefined  undefined  <-  p2j

(the undefined terms are dropped)

0                            <- p1j*log2(p1j)
0                            <- p2j*log2(p2j)
--         ---        ---    
0                           <- I(s1j....s3j)=- sum of above
2          0          0     <-   s1j+..+s3j

0          0          0      <- (s1j+..+s3j)*I(s1j..s3j)/16

                      0     <-  E(B3) (sum of above)
                      .75   <-  I(s1..sm)
          GAIN(B3) =  .75   <-  I(s1..sm) - E(B3)



Continuing with B4=A={a1..av} used to classify S into {A1..Sv},
---------------------------------------------------------------



Take B4 = {a1,a2} = {11,15} as the 2nd candidate attribute.
        Aj={t:t(B4)=aj}, where a1=1101, a2=1111
        sij is number of samples of class, Ci, in a subset, Aj.
     so sij = rc( P1(ci)^P2(aj) ), where ci is in {3,7}
                                   and   aj is in {11,15}.

The sij's are:
j=1       j=2        
---       ---        
1          0              <-- s1j
2          0              <-- s2j
---        ---                
3          0              <- s1j+s2j

3          0              <- |Aj| (divisors)

.33        undefined      <-  p1j
.67        undefined      <-  p2j

(the undefined terms are dropped)

-.541                        <- p1j*log2(p1j)
-.387                        <- p2j*log2(p2j)
--         ---        ---    
.928                        <- I(s1j,s2j)=- sum of above
3          0          0     <-   s1j+s2j

.174       0          0      <- (s1j+s2j)*I(s1j,s2j)/16

                      .174  <-  E(B3) (sum of above)
                      .75   <-  I(s1..sm)
          GAIN(B4) =  .576  <-  I(s1..sm) - E(B4)

          GAIN(B3) =  .75   <-  I(s1..sm) - E(B3)


So B3 is the decision attribute and so forth.

Note that no database scan has been needed at all!





ID3 DTI   





Bayesian Classification

Bayesian classifiers are statistical classifiers

7.4.1 Bayes Theorem

Let X be a data sample whose class label is unknown.
Let H be a hypothesis (ie, X belongs to class, C).
P(H|X) is the posterior probability of H given X.
P(H) is the prior probability of H.

Bayes Theorem:
P(H|X) = P(X|H)P(H)/P(X)

7.4.2 Naive Bayesian Classification


1. Each data sample is represented by feature vector, X=(x1..,xn)
depicting the measurements made on the sample from A1,..An, resp.


2. Given classes, C1,...Cm, the naive Bayesian Classifier will
 predict unknown data sample, X (with no class label), belongs to
 class, Cj (called the maximum posteriori hypothesis), having the
 highest posterior probability, conditioned on X
 ( P(Cj|X) > P(Ci|X),  i not j).


P(Cj|X) = P(X|Cj)P(Cj)/P(X)



3. P(X) is constant for all classes so we maximize P(X|Cj)P(Cj).
  If we assume equal liklihood of classes, maximize P(X|Cj),
  else P(Ci) estimated as si/s.

      From the PC-cube we see that s is the overall tuple count
      and si is the rootcount of DRollup[Bcube->C]i

      (thus, it is rc(VPCtree[Ci]) assuming C=Bn = rc(PCn1* AND
       ... AND PCnm*  where there Ci is m-bit string and there is
        a * for each 0 bit in the string)



4. To reduce the computational complexity of calculating all
 P(X|Cj)'s the naive assumption of conditional independence of
 values is often made (therefore the name "Naive Baysian"),
 thus, P(X|Ci)=P(xk|Ci)*..*P(xn|Ci).

For categorical attributes, P(xk|Ci)=sixk/si  where sixk= # of
 training samples of class, Ci, having Ak-value xk
 (PCn,Ci ^ PCk,xk, which is one AND program).

For continuous attributes, use Gaussian distribution to estimate
  P(xk|Ci).

  Once the P(xk|Ci)'s are estimated, the model is "trained".



Example:
Consider the training set, S, where B1 is the class label attribue
S:  
 B1   B2   B3   B4
0011 0111 1000 1011
0011 0011 1000 1111
0111 0011 0100 1011
0111 0010 0101 1011
0011 0111 1000 1011
0011 0011 1000 1011
0111 0011 0100 1011
0111 0010 0101 1011
0010 1011 1000 1111
0010 1011 1000 1111
1010 1010 0100 1011
1111 1010 0100 1011
0010 1011 1000 1111
1010 1011 1000 1111
1111 1010 0100 1011
1111 1010 0100 1011

__C1___   __C2___   __C3___   __C4___   __C5___
P1,0010   P1,0011   P1,0111   P1,1010   P1,1111
3         4         4         2         3      
0 0 3 0   4 0 0 0   0 4 0 0   0 0 1 1   0 0 0 3 

                       
P2,0010   P2,0011   P2,0111   P2,1010   P2,1011   
2         4         2         4         4     
0 2 0 0   2 2 0 0   2 0 0 0   0 0 0 4   0 0 4 0          

s1x2=0010 s1x2=0011 s1x2=0111 s1x2=1010 s1x2=1011
    0         0         0         0         1     <-- s1x2/s1
    0         .5        .5        0         0     <-- s2x2/s2
    .5        .5        0         0         0     <-- s3x2/s3
    0         0         0         .5        .5    <-- s4x2/s4
    0         0         0         1         0     <-- s5x2/s5


__C1___   __C2___   __C3___   __C4___   __C5___
P1,0010   P1,0011   P1,0111   P1,1010   P1,1111
3         4         4         2         3      
0 0 3 0   4 0 0 0   0 4 0 0   0 0 1 1   0 0 0 3 

P3,0100   P3,0101   P3,1000                                                  
6         2         8      
0 2 0 4   0 2 0 0   4 0 4 0                                                  

s1x3=0100 s1x3=0101 s1x3=1000                       
    0         0         1                         <-- s1x3/s1
    0         0         1                         <-- s2x3/s2
    .5        .5        0                         <-- s3x3/s3
    .5        0         .5                        <-- s4x3/s4
    1         0         0                         <-- s5x3/s5
                                                                             

__C1___   __C2___   __C3___   __C4___   __C5___
P1,0010   P1,0011   P1,0111   P1,1010   P1,1111
3         4         4         2         3      
0 0 3 0   4 0 0 0   0 4 0 0   0 0 1 1   0 0 0 3 

P4,1011   P4,1111
11        5      
3 4 0 4   1 0 4 0

s1x4=1011 s1x4=1111
    0         1                                   <-- s1x4/s1
    .75       .25                                 <-- s2x4/s2
    1         0                                   <-- s3x4/s3
    .5        .5                                  <-- s4x4/s4
    1         0                                   <-- s5x4/s5



5. In order to classify an unknown sample, X, P(X|Ci)P(Ci) is
 evaluated for each i, then X is assigned to the class for which
 it is maximum.  (  Evaluate, P(xk|Ci)*..*P(xn|Ci) * P(Ci)  )

s1x2=0010 s1x2=0011 s1x2=0111 s1x2=1010 s1x2=1011
    0         0         0         0         1     <-- s1x2/s1
    0         .5        .5        0         0     <-- s2x2/s2
    .5        .5        0         0         0     <-- s3x2/s3
    0         0         0         .5        .5    <-- s4x2/s4
    0         0         0         1         0     <-- s5x2/s5
s1x3=0100 s1x3=0101 s1x3=1000                       
    0         0         1                         <-- s1x3/s1
    0         0         1                         <-- s2x3/s2
    .5        .5        0                         <-- s3x3/s3
    .5        0         .5                        <-- s4x3/s4
    1         0         0                         <-- s5x3/s5
s1x4=1011 s1x4=1111
    0         1                                   <-- s1x4/s1
    .75       .25                                 <-- s2x4/s2
    1         0                                   <-- s3x4/s3
    .5        .5                                  <-- s4x4/s4
    1         0                                   <-- s5x4/s5

                           sixk/si's:    si/s  P(X|Ci)=P(Ci)
         x2   x3   x4     ------------   ----  -------------
Take X= 0011 1000 1011    0     1    0   3/16      0
                         1/2    1   3/4  4/16     3/32
                         1/2    0    1   4/16      0
                          0    1/2  1/2  2/16      0
                          0     0    1   3/16      0


So X is classified as C2.

So we see that, once the conditional probabilities, sixk/si, are
 derived from the P-trees, any new sample can be classified
 instantly.



How effective are Naive Bayesian Classifiers?

  - In theory they have low error rates in comparison to other
    classifiers.

  - in practice it is not always true, because the assumptions
    may not be valid.

  - Various empirical studies have found Naive Bayesian
    Classifiers to be comparable to decision tree and neural
    network classifiers in many domains.

  - They also provide a theoretical justification for other
    classifiers that do not explicitly use Bayes Theorem
    (e.g., under certain assumptions it can be shown that NN and
    curve-fitting algorithms (eg, ID3) output the "maximum
    posteriori hypothesis" as does the Naive Bayesian Classifier.




7.4.3 Bayesian Belief Networks (to handle cases where the naive
      assumption doesn't hold)

   - The Naive Assumption of "class conditional independence"
     (given the class label of a sample, the values of the
      attributes are conditionally independent of one another)
      which allows use of the simplifying formula:
      P(X|Ci)=P(xk|Ci)*..*P(xn|Ci), when true, produces the most
      accurate classifier of all.

   - In practice dependencies can exist between attributes
     (variables).

      - In spatial datasets, one approach would be to select out
        attributes that are independent and then use Naive
        Bayesian Classifiers.  (e.g., select RIR and leave out
        G since there is correlation between them)

   - However, with PC-trees we have a way to calculate P(X|Ci)
     directly.

     It is the AND of the tuple PC-tree for X with the value
     PC-tree for Ci (noting that X is a tuple in Rel[X]
     not Rel - eliminating Coord and C)


   - Bayesian Belief Networks specify the joint conditional
     probabilities and allow class conditional independence
     to be defined between subsets of attributes (variables)
     namely those subsets that are conditionally
     independent of oneanother.

      - Note that the notion of functional dependence in
        normalizing relations
        is a specification of conditional dependence.


A Belief Network (or Bayesian Belief Network or Bayesian
  Network or Probabilistic Network) is composed of two
  components,


1.  an acyclic directed graph (nodes=attributes or random
    variables; edges=variables (actual attributes or "hidden
    variables" such as medical syndrome in medical data)

      - each variable is conditionally independent of its
        non-descendents, given its parents.

2.  a Conditional Probability Table (CPT) for each variable,
    Z, specifying all P(Z|parentZ).


7.4'  Non-Naive Baysian Classifier (New section, shortcut to
      Baysian Belief Net for spatial data with Ptrees).

We can use Baysian Classification directly without the Naive
   assumption, since we do not need to make the simplifying Naive
   assumption that P(X|Ci)=P(xk|Ci)*..*P(xn|Ci) since we can
   compute the actual  P(X|Ci) directly (in fact it is a simpler
   program than the above) as:  TPC(X) ^ VPC(Ci).

We do not need Baysian belief networks to estimate these numbers!



Bayesian Classifiers   



Classification by Backpropagation

   - A Neural network is a set of connected input/output units
      - Each connection has a weight
      - In the learning phase, adjusts weights to learn to
        predict class of input samples.

   - Backpropagation is a particular Neural Network learning alg
      - It operates on a "multilayer feed-forward network"

Multi-layer Feed-Forward Neural Network


          Input          Hidden          Output
          layer          layer           layer

         .----.          .----.          .----.
 x1      |    |----------|    |----------|    |- >
         `----'-.      .-`----'-.      .-`----'
               \ `-..-' /      \ `-..-' /       
         .----..\-'  `-/..----..\-'  `-/..----.
 x2      |    |--\----/--|    |--\----/--|    |- >
         `----'\  \  /  /`----'\  \  /  /`----'
           .    \  \/  /   .    \  \/  /   .   
                 `./\.'          `./\.'         
           .      /\/\     .      /\/\     .     
                 / /\ \          / /\ \           
           .    /.'  `.\   .    /.'  `.\   .       
         .----./'      `\.----./'      `\.----.
 xi      |    |----------|    |----------|    |- >
         `----'   wij    `----'    wjk   `----'
                           Oj              Ok

   - Inputs correspond to attributes from training samples.

   - Weighted outputs of the Input units are fed to the Hidden
     units (many Hidden layers?).

   - Weighted outputs of last Hidden layer's units are fed to the
     Output units.

   - Output units emit the network's prediction for the given
     samples.

   - Hidden and Output units are often referred to as "neurodes"

   - A n-layer NN has n layers other than the Input layer
       (includes Hidden and Output).

   - NN is "feed-forward" since none of the weights cycle back to
     an input unit or output unit of a previous layer.


Defining a Network Topology

   - Specify the number of Input units
   - Specify the number of Hidden layers
   - Specify the number of Hidden units in each Hidden layer
   - Specify the number of Output units

   - Normalize the input values for each attribute in training
     set speeds up training.

Backpropagation

   - learns by iteratively,

     -  processing a set of training samples,

     -  comparing the network's prediction for each sample with
        the actual known class label.

     -  For each training sample, weights are modified to minimize
        mean-square error between the network's prediction and the
        actual class.

     -  These modifications made in a "backwards" direction, from
        Output layer, through each Hidden layer to the Input layer

     -  The weights will (usually) eventually converge, and the
        learning process stops.


The Backpropagation Algorithm:

(1) Initialize all weights and biases in "network";
(2) while terminating condition is not satisfied {
(3)   for each training sample X in "samples"
(4)       // Propagate the inputs forward:
(5)       for each hidden or output layer unit j {
(6)           Ij = SUM(i)[ wij*Oi + theta(j) ]
                  //compute the net input of the
                    unit j wrt previous layer, i//
(7)           Oj = 1/(1+e^(-Ij);}  //compute the output of
                                     each unit, j//
(8)       // Backpropagate the errors://
(9)       for each unit j in the output layer
(10)         Errj = Oj(1-Oj)(Tj-Oj);
          // compute the error with respect to the next higher
             layer, k//
(11)      for each unit j in the hidden layers, from last to
          1st hidden layer
(12)         Errj = (l)*Errj*Oi  // weight increment
(13)      for each weight wij in "network" {
(14)         DELTAwij = (l)*Errj*Oj  // weight increment
(15)         wij = wij + DELTAwij}   // weight update
(16)      for each bias THETAj in "network" {
(17)         DELTA(THETAj) = (l)*Errj;  //bias increment
(18)         THETAj = THETAj + DELTA(THETAj) }  // bias update
(19)      }}


(1) The weights in the network are initialized to small
      random numbers (eg, -1 to 1 or?).

Each unit has a "bias" also initialized to a small random num.


For the jth layer (as inputs to the jth layer, where the Input
    layer is 0) j=1..m (m+1 layers total, including input):

| w11 w12... w1n1 | 
| w21 w22... w2n1 |
| w31 w32... w3n1 |
|  .              | = W1
|  .              |
|  .              |
|wn01 wn02.. wn0n1|

| z(1)1 | 
| z(1)2 | 
| .     |  = Z1
| .     | 
| z(1)n1| 

etc.

(4)-(7) Net input to Hidden/Output unit,
    j: Ij=SUM(i)[wij*Oi+zj] where wij is the weight of the
    condition from unit, i, in previous layer to unit, j;
Oi is the output of unit i; zj is the bias of the unit
(threshold -varying activity of the unit)


Each units takes its net input; applies an "activation function"

      - symbolized activation of the neuron

      - logistic or simoid (or "squashing fctn since it maps
        a large input domain into [0,1]) is used: Given net
        Input, Ij, to unit j, then the output of unit j
        is: O'j = 1/(1+e^-Ij)

      - the logistic fctn is nonlinear and differentiable,
        allowing backprop algorithm to model classification
        problems that are linearly inseparable 


So the output, O'j, of unit-j, given
  - output from previous-layer, unit-i of Oi,
  - connection weight, wij,
  - bias zj, is:

(O1 O2..Onj-1) | w11 w12 ... w1nj | + ( z1 z2 ... znj )
                                      = ( I1 ... Inj )
               | w21 w22 ... w2nj |   
               | w31 w32 ... w3nj |  
               |  .               |
               |  .               | 
               |  .               |
               |wnj-11...wnj-1,nj |


and at each layer, 
               _____________1________________
Oj = f(Ij) =        -(SUM(i)[wij*)i+zj]
               1 + e               

We will write it using matrix motation as follows:



At layer j, the
from previous layer, outputs are Oj-1
                     weights are Wj
                     inputs  are Ij
                     outputs are Oj (after applying activation
                                     fctn)


  O(j-1)*Wj+Zj =>  Ij  =>  Oj=f(Ij)
              

(8)-(18) The error is propagated backwards once the output of
  the Output layer is computed, to update weights and biases.
  For Output unit, Om, Errm=Om(1-Om)(Tm-Om);  Om is "actual"
  output and Tm is the "true" output based on the known class
  label of the training sample

Noting that for f(x)= 1/(1-e^-x),     f'(x)= e^-x / (1+e^-x)^2

   and f(x)*(1-f(x))= 1/(1-e^-x) * (1 - 1/(1+e^-x)) =
                                             e^-x / (1+e^-x)^2

we see that we are just using a straight line assumption as to
  the input DELTA value since DELTA(x) = y' * DELTA(y),
  where y'=Om(1-Om) and  DELTA(y) = (Tm-Om)

The error in a Hidden layer-j, use the weighted sum of the errors
  of the units connected to j from the next layer:
                               Errj = Oj(1-Oj)*SUM(k)[Errk*wjk]

 where wjk=weight of connection from unit-j to a unit-k in the
 next higher layer and Errk is the error of unit-k.

Weights are updated:                  DELTAwij = (l) * Errj * Oj
        and wij = wij * DELTAwij
        where l=learning rate, a constant, typically in (0,1).

   - Backpropagation learns using a method of gradient descent
     to search for a set of weights that can model the given
     classification problem so as to minimize the mean squared
     distance between the network's class prediction and the
     actual class label of samples.

     The learning rate helps to avoid getting stuck at a local
     minimum in decision space; if too low, learning is very slow.
     If too high, thrashing between suboptimals can occur.
     A rule of thumb is to set the learning rate to 1/t
     where t=number of iterations through the training set so far.


Biases are updated:                   DELTA(zj) = (l) * Errj

Here we are updating the weights and biases after presentation of
 each sample (case updating).  Alternatively, weight and bias
 updates (DELTAs) can be accumulated in variables so that
 updating can be applied after the entire training set has been
 presented (epoch updating).

(one iteration through the training set is an epoch)

In theory (mathematical) epoch updating is better, yet in
 practice, case updating is more common since it tends to yield
 more accurate results.


(2)-(3) Training stops when either

   - all DELTAwij in the previous epoch were so small as to be
     below some threshold or

   - the % of samples misclassified in the previous epoch is
     below some threshold or

   - a pre-specified number of epochs has expired.

(in practice several hundred thousand epochs may be required.)

          Input          Hidden          Output
         .----.          .----.          .----.
 x1      | x1 |----------| X1 |----------| y1 |- > y1
         `----'-.      .-`----'-.      .-`----'
               \ `-..-' /      \ `-..-' /       
         .----..\-'  `-/..----..\-'  `-/..----.
 x2      | x2 |--\----/--| X2 |--\----/--| y2 |- > y2
         `----'\  \  /  /`----'\  \  /  /`----'
           .    \  \/  /   .    \  \/  /   .   
                 `./\.'          `./\.'         
           .      /\/\     .      /\/\     .     
                 / /\ \          / /\ \           
           .    /.'  `.\   .    /.'  `.\   .       
         .----./'      `\.----./'      `\.----.
 xI      | xI |----------| XJ |----------| yK |- > yK
         `----'   wij  zj`----'    Wjk ZK`----'


(x1..xI)*|w11..w1J|+|z1|=>f=>(X1..XJ)*|W11..W1K|+|Z1|=>f=>(y1..yK)   
         | .    . | |. |              |.    .  | |. |      
         |wI1..wIJ| |zJ|              |WJ1..WJK| |ZK|      

**************************************************************







Other Classification Methods

k-nearest Neighbor Classifiers (based on learning by analogy

- unknown samples are assigned to the most common class among
   its k-nearest neighbors in n-space.

- instance based.

- lazy or "as you go" learner (by contrast to decision trees where
  the classifier is constructed before new samples are considered)

- With respect to spatial data in REL organization, if B1 is the
  class label attribute, what should be meant by the k-nearest
  ngbrs?

- Let's assume there is one REL dataset for learning and the new
  samples are separate from it (e.g., for RGBY data, take the
  point of view that we use last years dataset with RGB and Y
  to train and are interested in classifying this years RGB data
  to predict the Y).



A Spatial k-nearest ngbr algorithm

Assume we have basic Ptrees for the training set.

We find the k-nearest ngbrs to a new sample, x, and then
   predict the class of x to be the majority class among those
   k ngbrs.

So we will find the closest k (or more) training tuples, based on
  a weighted Manhattan distance on the non-class attribute values
  (e.g., if B1 is the Class label attribute,
  wm_dis(x,y) = SUM(i=2..n)[wi*|yi-xi|], where 0= k done.
   (class label is the one that gives the max rootcount when its
   Ptree is ANDed with Px - i.e., we compute rc(Px^Pci) for each
   class label, ci and assign the one that gives max rootcount.)

2. If rc(Px) < k, remove the lowest-order bit from the
   highest-weight band value of x,
   (we will call the resulting tuple, x also - since it is just
    the original tuple x, with its Bi-value generalized one
    level up the value concept hierarchy to a 7-bit value instead
    of an 8-bit value).

Repeat 1 and 2 until rc(Px) >= k

(note, when we have removed the low order or 8th bit from all
 of the non-class-label attributes of x, we proceed to removing
 the 7th bit one attribute at a time, then the 6th bit and so
 forth.)

(note, we can decide to remove several bits at a time so as to
 reduce the complexity.  We may get a ngbr set that has many more
 than k ngbrs in it but that shouldn't be a problem.  If for some
 reason it seems important to get the smallest ngbr set that
 qualifies (closest to k) rather than ordering the attributes by
 "importance" we could calculate the ngbr set size for each
 attribute during each "bit removal pass" and pick the one that
 gives us the best ngbrset...  Lots of variations are possible.)

(note, while calculating the rc's above it would make sense to
 have an accumulator for the rc's for each attribute values
 for several of the passes (8-bit, 7-bit, ...  values).

 This can be done with a single scan parallel program (lots of
 accumulators however).  This gives us maximum flexibility in
 deciding the best ngbrset.  We could also be computing the
 Px^Pci rootcounts during this one single scan pass).

(Note, in the event that we get through 1-bit values without the
 ngbrset reaching size, k, (could that happen?

 How? and if so, what could be done about it?) we could make
 resort to the traditional training set scan to classify that
 particular new sample.)


Example:
Traning Dataset (B1 is the class label attribute and k=5):
X-Y  B1   B2   B3   B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011
Consider the new sample is:  x = ---- 1011 1000 1111

The basic PC_trees in PQ-list form:
PQ11:  23 3
PQ12:  1 31 32 33
PQ13:  pure
PQ14:  0 1 31 32 33

PQ21: 2 3
PQ22: 00 02
PQ23: 0 1 2 3
PQ24: 0 10 12 2

PQ31: 0 2
PQ32: 1 3
PQ33: null
PQ34: 11 13

PQ41: pure
PQ42: 01 2
PQ43: pure
PQ44: pure

Assume the weights order the bands from high-to-low B2,B3,B4
Consider the new sample is:  x = ---- 1011 1000 1111
C = {0010 0011 0111 1010 1111} (class labels)
The needed PQ-seq's are:
PQ1,0010: 20 21 22
PQ1,0011: 0
PQ1,0111: 1
PQ1,1010: 23 30
PQ1,1111: 31 32 33

PQ2,1011: 2
PQ3,1000: 0 2 
PQ4,1111: 00 02 2

1. If rc(Px) >= k done.  (class is s.t.  rc(Px^Pci) is max.)

   Px: 2  rc(Px)=4  NOT >= k=5

2. If rc(Px) <  k, loworder bit from next band value...
   Take off the loworder bit from B2:

PQ2,101 : 2 3  (gives the same result for Px so do same with B3)
PQ3,100 : 0 2  (gives the same result)
PQ4,111 : 00 02 2  (gives the same result)

next loworder bit removal:
PQ2,10 : 2 3  (gives the same result for Px so do same with B3)
PQ3,10 : 0 2  (gives the same result)
PQ4,11 : 00 02 2  (gives the same result)

next loworder bit removal:
PQ2,1 : 2 3  (gives the same result for Px so do same with B3)
PQ3,1 : 0 2  (gives the same result)
PQ4,1 : pure  
------------
PQx:    2    (gives the same result)

next loworder bit removal:
PQ2,1 : pure
PQ3,1 : 0 2
PQ4,1 : pure  
------------
PQx:    0 2  has rc = 8 >= 5.

rc(Px) >= k, class is s.t.  rc(Px^Pci) is max.)

PQ1,0010: 20 21 22
PQx:      0 2
--------------
          20 21 22   rc= 3

PQ1,0011: 0
PQx:      0 2
--------------
          0          rc= 4

PQ1,0111: 1
PQx:      0 2
--------------
          null       rc= 0

PQ1,1010: 23 30
PQx:      0 2
--------------
          23         rc= 1

PQ1,1111: 31 32 33
PQx:      0 2
--------------
          null       rc= 0

Thus, the class label for x is 0011

*********Notes *********************************************
Problems?

1. Consider the problem of a ngbr that is positioned in
 large numbers right near a quadrant boundary, so that
 it has ngbrs which don't appear to be ngbrs in the Ptree.
  (This may not be a problem, since we are dealing
   with whole values.  The real problem is 2.)

2. For a value like, 0111, note that it is at the edge,
   not the middle of the intervals,
   [0110,0111],
   [0100,0111],
   which are the ngbrhds used when removing the first 2
   low-order bits (note that the same thing happens with 1111
   but it is inevitably at the edge of all ngbrhds,
   while 0111 is not.)

   Better, 1st "nbrd" be [0110,1000] = [6,8]
                    2nd, [0100,1001] = [4,9].

   Or even better, 1st: [0110,1000] = [6,8],
                   2nd: [0101,1001] = [5,9].

[0111,0111]  [7,7]
[0110,1000]  [6,8]
[0101,1001]  [5,9]

Question:
In removing a loworder bit, can it be accomplished by ORing? e.g.,

To get:
PQ2,101 : 2 3

can we just OR:

PQ2,1011: 2
PQ24':    11 13 3
OR---------------
          11 13 2 3

where PQ24' is the comp of PQ24: 0 10 12 2
apparently not!

Note:
P2,101  =  P2,1010 v P2,1011 = (P2,101 ^ P24') v P2,1011 =
                     = (P2,101 v P2,1011) ^ (P24' v P2,1011)
                     =  P2,101            ^ (P24' v P2,1011) 

It's clear there is no way to construct, e.g., P21 from P2,11
 and the basic, P22 or its comp, since P2,11 is 1 where both
 P21 and P22 are 1.  Knowing where P22 is 1 doesn't tell me
 which of the pixels for which P22 is 0 have a 1 in P21.

That is to say, a 0 in P2,11 where P22 is also 0 tells me
 nothing about P21 at those pixels (it could be 0 or 1).

Therefore we need to retain all info on a subcube as we go to
 avoid further ANDing:

So, to answer the classification question (using our "nearest
 ngbr" like approach) we need to have filled in a cube:

Consider, again, the new sample:  x = ---- 1011 1000 1111 and
C = {0010 0011 0111 1010 1111} (class labels).

We need the cube bounded by all of 5 B1-values
 (the entire B1 dimension) and
P2,
   1011            [11,11]
   101   1100      [10,12]
   1001  1101      [ 9,13]
   1     111       [ 8,14]
   0111  1111      [ 7,15]
   0110            [ 6,15]
   0101            [ 5,15]
   01              [ 4,15]
   0011            [ 3,15]
   0010            [ 2,15]
   0001            [ 1,15]

Of these, the ones we see in the basic algorithm
 (removal of loworder bits) are:
P2,
   1011            [11,11]
   101             [10,11] not seen above
   10              [ 8,11] not seen above
   1               [ 8,15] not seen above

If we also include those needed to balance the intervals:
P2,
   1011            [11,11]
   101  1100       [10,12] not seen above
   10   111        [ 8,14] not seen above
   1               [ 8,15] not seen above

How far out should the intervals go before we stop
    (and consider the new sample an outlier - at which point
     we take the majority class of the ngbr-set, if there is
     one, else take the majority class of the sample space)?

   - One thought would stop after Radius =
                 ROOF{SQRT(|S|) / ROOF[SQRT(|S|/k)]}

     Rationale:  If the samples are uniformly distributed with
     duplicity=k, each duplicity group would be at and
     intersection of grid lines with the above spacing.

   - SQRT(|S|/k) = SQRT(16/5) = 1.78, roof is 2.
     so R = 4 / 2 = 2
P2,
   1011            [11,11]
   101  1100       [10,12]
   10   111        [ 8,14]

P3,
   1000            [ 8, 8]
        0111  1001 [ 7, 9]
        011   1010 [ 6,10]

P4,
   1111            [15,15]
        111   1111 [14,15]
        11    1111 [13,15]

then once the ngbrset is found, AND with the following
 to classify
P1,0010
   0011
   0111
   1010
   1111



Misc Classification