Machine Learning (ML)
is an older term for Data Mining, which included 2, CLASSIFICATION and CLUSTERING,
of the 3 Data Mining areas of: Assoc. Rule Minning, Classification and Clustering.
A still older term, Artificial Intelligence (AI), included all of these and much more.
CLASSIFICATION is the central area of the three!
Given a (large) TRAINING SET T(A1, ..., An, C) with CLASS C
and FEATURES A&equiv(A1,...,An)
C-CLASSIFICATION of an unclassified
sample, (a1,...,an) is just:
SELECT Max (Count (T.Ci))
FROM T
WHERE T.A1 = a1
AND T.A2 = a2
...
AND T.An = an
GROUP BY T.C;
i.e., just a SELECTION, since C-Classification is assigning to (a1..an)
the most frequent C-value in RA=(a1..an).
But, if the EQUALITY SELECTION is empty,
then we need a FUZZY QUERY to find NEAR NEIGHBORs (NNs)
instead of exact matches.
That's Nearest Neighbor Classification (NNC).
If SQL had a good Nearest Neighbor Set operator, we would be done.
But it doesn't, so NNC is essentially building a good NEAR NEIGHBOR operator.
E.g.,
Medical Expert System (Ask a Nurse): Symptoms plus past diagnoses
are collected into a table called CASES
For each undiagnosed new_symptoms,
CASES is searched for matches: SELECT DIAGNOSIS
FROM CASES
WHERE CASES.SYMPTOMS = new_symptoms;
If there is a predominant DIAGNOSIS,
Then report it,
ElseIf there's no predominant DIAGNOSIS,
Then Classify instead of Query, i.e.,
find fuzzy matches (near nbrs) SELECT DIAGNOSIS
FROM CASES
WHERE CASES.SYMPTOMS ≅ new_symptoms
Else call your doctor in the morning
That's exactly (Nearest Neighbor) Classification!!
CAR TALK radio show: Click and Clack the Tappet brothers have a vast
TRAINING SET of car problems and solutions built from experience.
They search that TRAINING SET for close matches to predict solutions
based on previous successful cases.
That's exactly (Nearest Neighbor) Classification!!
We all perform Nearest Neighbor Classification every day of our lives.
E.g., We learn when to apply specific programming/debugging techniques so that
we can apply them to similar situations thereafter.
COMPUTERIZED NNC &equiv MACHINE LEARNING
(most clustering (which is just partitioning) is done as
a simplifying prelude to classification).
Again, given a TRAINING SET, R(A1,..,An,C), with C=CLASSES and (A1..An)=FEATURES
Nearest Neighbor Classification (NNC) &equiv
selecting a set of R-tuples with similar features (to the unclassified sample)
and then letting the corresponding class values vote.
Nearest Neighbor Classification won't work very well if
the vote is inconclusive (close to a tie)
or if similar (near) is not well defined, then we
build a MODEL of TRAINING SET
(at, possibly, great 1-time expense?)
When a MODEL is built first the technique is called Eager classification,
whereas
model-less methods like Nearest Neighbor are called Lazy or Sample-based.
Eager Classifiers models can be:
decision trees,
probabilistic models (Bayesian Classifier),
Neural Networks,
Support Vector Machines, ...
How do you decide when an EAGER model is good enough to use?
How do you decide if a Nearest Neighbor Classifier is working well enough?
We have a TEST PHASE.
typically, we set aside some training tuples as a Test Set.
(then, of course, those Test tuples cannot be used in model building or
and cannot be used as nearest neighbors)
If the classifier passes the the test
(a high enough % of Test tuples are correctly
classified by the classifier) it is accepted.
EXAMPLE 1:
Computer Ownership TRAINING SET for predicting who owns a computer:
Customer Age Salary Job Owns Computer
1 | 24 | 55,000 | Programmer | yes
2 | 58 | 94,000 | Doctor | no
3 | 48 | 14,000 | Laborer | no
4 | 58 | 19,000 | Domestic | no
5 | 28 | 18,000 | Construction| no
A classifier might be built from this TRAINING SET (e.g., a decision tree) as follows:
Age < 30
/ \
T F
/ \
Salary > 50K No
/ \
T F
/ \
Yes No
It is easy to determine a pattern in this small dataset, however for large
datasets it is impossible to construct a decision tree model by "sight".
Therefore we need a Model Building Algorithm or training algorithm
EXAMPLE 2:
PRECISION AG YIELD CLASSIFIER predicts YIELD of a field grid cell
based on mid-year Blue, Green, Red, NearInfraRed reflectances from that cell.
The TRAINING SET is R(CELL, Blue, Green, Red, NIR, YIELD) from previous year.
1st Separate out a Test Set.
2nd Build a CLASSIFIER MODEL (decision tree) from remaining TRAINING SET
3rd Test MODEL accuracy using the Test Set. If it passes the test,
then when an aerial photo is taken during the growing season,
predict where low yeild can be expected using the MODEL
(then apply additional nutrients to those cells?)
TRAINING SET
X Y Blue_____ Green____ Red_____ NIR_____ YIELD_
0 0 | 0000 1001 | 1010 1111 | 0000 0110 | 1111 0101 | medium
0 1 | 0000 1011 | 1011 0100 | 0000 0101 | 1111 0111 | medium
0 2 | 0000 1011 | 1011 0101 | 0000 0100 | 1111 0111 | high
0 3 | 0000 0111 | 1011 0111 | 0000 0011 | 1111 1000 | high
0 4 | 0000 0111 | 1011 1011 | 0000 0001 | 1111 1001 | high
0 6 | 0000 1000 | 1011 1111 | 0000 0000 | 1111 1011 | high
1 0 | 0001 1101 | 1010 1110 | 0000 0111 | 1111 0100 | medium
2 1 | 0000 1111 | 1011 0101 | 0000 0110 | 1111 0110 | medium
3 2 | 0001 1111 | 1011 0111 | 0000 0101 | 1111 0110 | medium
4 3 | 0001 1111 | 1011 0110 | 0000 0010 | 1111 1000 | high
5 4 | 0001 1111 | 1011 1010 | 0000 0010 | 1111 1000 | high
6 6 | 0001 1111 | 1011 1110 | 0000 0001 | 1111 1010 | high
Separate out as TEST SET
X Y Blue_____ Green____ Red_____ NIR_____ YIELD_
1 0 | 0001 1101 | 1010 1110 | 0000 0111 | 1111 0100 | medium
2 1 | 0000 1111 | 1011 0101 | 0000 0110 | 1111 0110 | medium
3 2 | 0001 1111 | 1011 0111 | 0000 0101 | 1111 0110 | medium
4 3 | 0001 1111 | 1011 0110 | 0000 0010 | 1111 1000 | high
5 4 | 0001 1111 | 1011 1010 | 0000 0010 | 1111 1000 | high
6 6 | 0001 1111 | 1011 1110 | 0000 0001 | 1111 1010 | high
TRAIN a Classifier with the remainder (a decision tree)
REMAINDER of the TRAINING SET
X Y Blue_____ Green____ Red_____ NIR_____ YIELD_
0 0 | 0000 1001 | 1010 1111 | 0000 0110 | 1111 0101 | medium
0 1 | 0000 1011 | 1011 0100 | 0000 0101 | 1111 0111 | medium
0 2 | 0000 1011 | 1011 0101 | 0000 0100 | 1111 0111 | high
0 3 | 0000 0111 | 1011 0111 | 0000 0011 | 1111 1000 | high
0 4 | 0000 0111 | 1011 1011 | 0000 0001 | 1111 1001 | high
0 6 | 0000 1000 | 1011 1111 | 0000 0000 | 1111 1011 | high
____________________________________
/ | \
/ | \
/ | \
/ | \
NIR ≤ 01000000 01000000 < NIR ≤ 11110111 11110111 < NIR
^ Red ≥ 00100000 ^ 00100000 > Red ≥ 00000101 ^ 00000101 > Red
/ | \
/ | \
YIELD = low YIELD = medium YIELD = high
TEST Classifier
TEST SET
X Y Blue_____ Green____ Red_____ NIR_____ YIELD_ PREDICTED YIELD
1 0 | 0001 1101 | 1010 1110 | 0000 0111 | 1111 0100 | medium | medium
2 1 | 0000 1111 | 1011 0101 | 0000 0110 | 1111 0110 | medium | medium
3 2 | 0001 1111 | 1011 0111 | 0000 0101 | 1111 0110 | medium | medium
4 3 | 0001 1111 | 1011 0110 | 0000 0010 | 1111 1000 | high | high
5 4 | 0001 1111 | 1011 1010 | 0000 0010 | 1111 1000 | high | high
6 6 | 0001 1111 | 1011 1110 | 0000 0001 | 1111 1010 | high | high
Tests out to be 100% correct (Gets and A grade!).
USE Classifier Model (decision tree) to classify:
New Data: R,G,B,NIR from an aerial image taken on ~4th of July:
X Y Blue_____ Green____ Red______ NIR______
0 6 | 0001 1100 | 1011 1110 0000 0001 | 1111 1110
___________________ ===================
/ | \\
/ | \\
/ | \\
/ | \\
NIR ≤ 01000000 01000000 < NIR ≤ 11110111 1111 0111 < NIR
^ Red ≥ 00100000 ^ 00100000 > Red ≥ 00000101 ^ 0000 0101 > Red
/ | \\
/ | \\
YIELD = low YIELD = medium YIELD = high
Preparing Data for Classification
Data Cleaning (of noise and missing values)
Remove Noise (or reduce noise) by "smoothing"
Fill in missing values (with most common or some statistical value)
NOTE: Even Noise and Missing Value management can be done by a
Nearest Neighbor Vote! (called interpolation)
Feature Extraction to eliminate irrelevant attributes (e.g., in the PA example,
eliminate Blue, Green since they're irrelevant to the decision).
Ways of Comparing Different Classification Methods
Predictive Accuracy (predicting the class label of new data)
Speed (computation costs for generating and using the model)
Robustness (does it give almost the same predictions when
the Training Set are almost the same?)
Scalability (Model construction efficiency - massive datasets)
More Detail on Some Classification Methods:
K-Nearest-Neighbor Classification
Decision Tree Models for EAGER CLASSIFICATION:
each inode is a test on a feature attribute (composite?),
each test outcome is assigned a link to the next level
(outcome=a value or range of values or...)
each leaf represent a class (or distribution of classes)
Unknown sampes are classified by their testing feature attributes against the tree.
The leaf arrived at, holds the class prediction for that sample.
Some branches may represent noise or outliers (and should be pruned?)
The ID3 algorithm for inducing a decision tree from training tuples is:
1. The tree starts as a single node containing the entire TRAINING SET.
2. If all TRAINING TUPLES have the same class, this node is a leaf. DONE.
3. Otherwise, use a measure, information gain, as a heuristic for
selecting the best decision attribute for that node
4. Branch is created for each value [interval of values] of that test attribute
and the TRAINING SET is partitioned accordingly.
5. Recurses on 2,3,4, until The Stopping Condition is true.
Possible Stopping Conditions:
All samples for a given node belong to the same class (label with that class)
∃ no remaining candidate decision attributes (label with plurality class).
Some other stopping rule.
Information Gain as an Attribute Selection Measure
Minimizes expected number of tests needed to classify an object
and guarantees simple tree (not necessarily the simplest)
At any stage, let
S = {s1,...,sm} be a TRAINING SUBSET.
S[C] = {C1,...,Cc} be the distinct classes in S.
EXPECTED INFORMATION needed to classify a sample given S as TRAINING SET is:
I{s1,...,sm} = -∑i=1..mpi*log2(pi) pi= |S∩Ci|/|S|
Choosing A as decision attribute, the
Expected Classification Info gained is
E(A) = ∑j=1..v; i=1..m ( si,j/|S| * I{sij..smj} ) where Skh = SA=ak∩Ch
Gain(A) = I(s1..sm) - E(A)
- expected reduction of info required to classify
after splitting via A-values.
The algorithm above computes the information gain of each
attribute and selects the one with the highest information gain
as the test attribute.
Branches are created for each value of that attribute and samples
are partitioned accordingly.
Pruning
=======
When a decision tree is built, many of the branches will reflect
anomalies in the training data due to noise or outliers.
Tree pruning methods address this problem of "overfitting" the
data (classifying situations that are erroneous or accidental).
Such methods typically use statistical measures to remove the
least reliable branches, resulting in faster classification and
an improvement in the ability of the tree to corredtly classify
independent test data.
Extracting Classification Rules from Decision Tress
One rule per each path from root to leaf.
Each (attr,value) along path forms a conjunction in the antecedent
Leaf holds class prediction or consequent.
May be easier for humans to understand rules.
More on Decision Tree Induction (powerpoint Introduction)
Note that the notion of "near" requires a distance or similarity measure to exist.
What are some of them?
Metrics (distance functions on feature space)
The example:
Training Data:
Band B1: Band B2: Band B3: Band B4:
3 3 7 7 7 3 3 2 8 8 4 5 11 15 11 11
3 3 7 7 7 3 3 2 8 8 4 5 11 11 11 11
2 2 10 15 11 11 10 10 8 8 4 4 15 15 11 11
2 10 15 15 11 11 10 10 8 8 4 4 15 15 11 11
S:
X-Y B1 B2 B3 B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011
Suppose that B1 is the class label attribute (e.g., Yield)
Then the class labels are 2, 3, 7, 10, 15 (C1,..,C5).
We need to know the count of the number of pixels (rows in
the table above) that contain each value in each attribute.
We also need to know the count of pixels that contain pairs of
values, one from a descriptive attribute and the other from the
class label attribute.
Moreover we may wish to focus on only a portion of the dataset
(some part of the field) before making those count calculations.
The Ptree structure is perfect for providing those counts.
B11 B12 B13 B14
0000 0011 1111 1111
0000 0011 1111 1111
0011 0001 1111 0001
0111 0011 1111 0011
BASIC_PTREES_band1___________________
P1,1 P1,2 P1,3 P1,4
5 7 16 11
0 0 1 4 0 4 0 3 4 4 0 3
3 0 0 <-where "different" bit is
VALUE_PTREES_band1___________________ (2-bit precision, 3, etc):
P1(00) P1(01) P1(10) P1(11)
7 4 2 3
4 0 3 0 0 4 0 0 0 0 1 1 0 0 0 3
3 3 0 0
P1(000) P1(010) P1(100) P1(110) P1(001) P1(011) P1(101) P1(111)
0 0 0 0 7 4 2 3
4 0 3 0 0 4 0 0 0 0 1 1 0 0 0 3
3 3 0 0
P1(0000 P1(0100 P1(1000 P1(1100 P1(0010 P1(0110 P1(1010 P1(1110
0 0 0 0 3 0 2 0
0 0 3 0 0 0 1 1
3 3 0
P1(0001 P1(0101 P1(1001 P1(1101 P1(0011 P1(0111 P1(1011 P1(1111
0 0 0 0 4 4 0 3
4 0 0 0 0 4 0 0 0 0 0 3
0
B21 B22 B23 B24
0000 1000 1111 1110
0000 1000 1111 1110
1111 0000 1111 1100
1111 0000 1111 1100
BASIC_PTREES_band2___________________
P2,1 P2,2 P2,3 P2,4
8 2 16 10
0 0 4 4 2 0 0 0 4 2 4 0
02 02 <-positions of the
two 1-bits
VALUE_PTREES_band2___________________
P2(00) P2(01) P2(10) P2(11)
6 2 8 0
2 4 0 0 2 0 0 0 0 0 4 4
13 02
P2(000 P2(010 P2(100 P2(110 P2(001 P2(011 P2(101 P2(111
0 0 0 0 6 2 8 0
2 4 0 0 2 0 0 0 0 0 4 4
13 02
P2(0000 P2(0100 P2(1000 P2(1100 P2(0010 P2(0110 P2(1010 P2(1110
0 0 0 0 2 0 4 0
0 2 0 0 0 0 0 4
13
P2(0001 P2(0101 P2(1001 P2(1101 P2(0011 P2(0111 P2(1011 P2(1111
0 0 0 0 4 2 4 0
2 2 0 0 2 0 0 0 0 0 4 0
1302 02
B31 B32 B33 B34
1100 0011 0000 0001
1100 0011 0000 0001
1100 0011 0000 0000
1100 0011 0000 0000
BASIC_PTREES_band3___________________
P3,1 P3,2 P3,3 P3,4
8 8 0 2
4 0 4 0 0 4 0 4 0 2 0 0
13
VALUE_PTREES_band3___________________
P3(00) P3(01) P3(10) P3(11)
0 8 8 0
0 4 0 4 4 0 4 0
P3(000) P3(010) P3(100) P3(110) P3(001) P3(011) P3(101) P3(111)
0 8 8 0 0 0 0 0
0 4 0 4 4 0 4 0
P3(0000 P3(0100 P3(1000 P3(1100 P3(0010 P3(0110 P3(1010 P3(1110
0 6 8 0 0 0 0 0
0 2 0 4 4 0 4 0
02
P3(0001 P3(0101 P3(1001 P3(1101 P3(0011 P3(0111 P3(1011 P3(1111
0 2 0 0 0 0 0 0
0 2 0 0
13
B41 B42 B43 B44
1111 0100 1111 1111
1111 0000 1111 1111
1111 1100 1111 1111
1111 1100 1111 1111
BASIC_PTREES_band4___________________
P4,1 P4,2 P4,3 P4,4
16 5 16 16
1 0 4 0
1
VALUE_PTREES_band4___________________
P4(00 P4(01 P4(10 P4(11
0 0 11 5
3 4 0 4 1 0 4 0
1 1
P4(000 P4(010 P4(100 P4(110 P4(001 P4(011 P4(101 P4(111
0 0 0 0 0 0 11 5
3 4 0 4 1 0 4 0
1 1
P4(0000 P4(0100 P4(1000 P4(1100 P4(0010 P4(0110 P4(1010 P4(1110
0 0 0 0 0 0 0 0
P4(0001 P4(0101 P4(1001 P4(1101 P4(0011 P4(0111 P4(1011 P4(1111
0 0 0 0 0 0 11 5
3 4 0 4 1 0 4 0
1 1
Suppose we take this relation as training set (4-bit values).
Let B1 be the class label attribute.
Then the classes are:
{ C1,C2,C3,C5,C5 } =
{ 2, 3, 7,10,15 } where Ci={ci}.
The ID3 alg for inducing a decision tree from training samples:
S:
X-Y B1 B2 B3 B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011
1. Tree starts as one node representing the training samples, S.
2. If all samples are in same class (same B1-value)
then S becomes a leaf with that class label. [Not true!]
3. Else, use entropy-based, "information gain" as a heuristic for
selecting the first decision attribute.
Take B2 = (a1,a2,a3,a4,a5} = { 2, 3, 7,10,11 }
as the first candidate attribute.
Aj={t:t(B2)=aj}, where a1=0010, a2=0011, a3=0111,
a4=1010, a5=1011.
sij is number of samples of class, Ci, in a subset, Aj.
so sij = rc( P1(ci)^P2(aj) ), where ci is in {2,3,7,10,15}
and aj is in {2,3,7,10,11}.
++---------+----------+----------+----------+--------
|| P2(2) | P2(3) | P2(7) |P2(10) |P2(11)
|| 2 | 4 | 2 | 4 | 4
--.----------|| 0 2 0 0 | 2 2 0 0 | 2 0 0 0 | 0 0 0 4 | 0 0 4 0
ci| P1(ci) || 13 | 1302 | 02 | |
==+==========++=========+==========+==========+==========+========
2| 3 || 0 | 0 | 0 | 0 | 3
| 0 0 3 0 || | | | | 0 0 3 0
| 3 || | | | | 3
--+----------++---------+----------+----------+----------+--------
3| 4 || 0 | 2 | 2 | 0 | 0
| 4 0 0 0 || | 2 0 0 0 | 2 0 0 0 | |
| || | 13 | 02 | |
--+----------++---------+----------+----------+----------+--------
7| 4 || 2 | 2 | 0 | 0 | 0
| 0 4 0 0 || 0 2 0 0 | 0 2 0 0 | | |
| || 13 | 02 | | |
--+----------++---------+----------+----------+----------+--------
10| 2 || 0 | 0 | 0 | 1 | 1
| 0 0 1 1 || | | | 0 0 0 1 | 0 0 1 0
| 3 0 || | | | 0 | 3
--+----------++---------+----------+----------+----------+--------
15| 3 || 0 | 0 | 0 | 3 | 0
| 0 0 0 3 || | | | 0 0 0 3 |
| 0 || | | | 0 |
--+----------++---------+----------+----------+----------+--------
EXPECTED INFO needed to classify the sample is:
I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ],
m = 5
s = 16
si = 3,4,4,2,3 (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16
I = -(3/16*lg2(3/16)+4/16*lg2(4/16)+4/16*lg2(4/16)+2/16*lg2(2/16)
+3/16*lg2(3/16))
= -( -.453 -.5 -.5 -.375
-.453 )
= -( -2.281) = 2.281
ENTROPY based on the partition into subsets by B2 is
E(B2)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ] where
Ij = I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj|
Since m=5, the sij's are:
j=1 j=2 j=3 j=4 j=5
--- --- --- --- ---
0 0 0 0 3 <-- s1j
0 2 2 0 0 <-- s2j
2 2 0 0 0 <-- s3j
0 0 0 1 1 <-- s4j
0 0 0 3 0 <-- s5j
--- --- --- --- ---
2 4 2 4 4 <- s1j+..+s5j
j=1 j=2 j=3 j=4 j=5
--- --- --- --- ---
2 4 2 4 4 <- |Aj|
where Aj's are the rootcounts of P2(aj)'s.
Therefore,
j=1 j=2 j=3 j=4 j=5
--- --- --- --- ---
0 0 0 0 .75 <- p1j
0 .5 .5 0 0 <- p2j
1 .5 0 0 0 <- p3j
0 0 0 .25 .25 <- p4j
0 0 0 .75 0 <- p5j
and
j=1 j=2 j=3 j=4 j=5
--- --- --- --- ---
0* 0 0 0 -.311 <- p1j*log2(p1j)
0 -.5 -.5 0 0 <- p2j*log2(p2j)
0 -.5 0 0 0 <- p3j*log2(p3j)
0 0 0 -.5 -.5 <- p4j*log2(p4j)
0 0 0 -.311 0 <- p5j*log2(p5j)
-- --- --- ----- ----
0 1 -.5 .811 .811 <- I(s1j..s5j)
2 4 4 4 4 <- s1j+..+s5j
so that,
0 .25 -.125 .203 .203 (s1j+..+s5j)*
I(s1j..s5j)/16
.531 E(B2)
2.281 I(s1..sm)
GAIN(B2) - > 1.750 I(s1..sm)-E(B2)
NOTE: ONE CAN GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)
Footnote * (If pij = 0 why is p1j*log2(p1j) = 0?
Hint: L'Hospital's Rule)
Continuing with B3
---------------------------------------------------------------
Take B3 = {a1,a2,a3} = {4,5,8} as the 2nd candidate attribute.
Aj={t:t(B3)=aj}, where a1=0100, a2=0101, a3=1000,
sij is number of samples of class, Ci, in a subset, Aj.
so sij = rc( P1(ci)^P2(aj) ), where ci is in {2,3,7,10,15}
and aj is in {4,5,8}.
++---------+----------+----------+
|| P3(4) | P3(5) | P3(8) |
|| 6 | 2 | 8 |
-------------|| 0 2 0 4 | 0 2 0 0 | 4 0 4 0 |
ci| P1(ci) || 02 | 13 | |
==+==========++=========+==========+==========+
2| 3 || 0 | 0 | 3 |
| 0 0 3 0 || | | 0 0 3 0 |
| 3 || | | 3 |
--+----------++---------+----------+----------+
3| 4 || 0 | 0 | 4 |
| 4 0 0 0 || | | 4 0 0 0 |
| || | | |
--+----------++---------+----------+----------+
7| 4 || 2 | 2 | 0 |
| 0 4 0 0 || 0 2 0 0 | 0 2 0 0 | |
| || 02 | 13 | |
--+----------++---------+----------+----------+
10| 2 || 1 | 0 | 1 |
| 0 0 1 1 || 0 0 0 1 | | 0 0 1 0 |
| 3 0 || 0 | | 3 |
--+----------++---------+----------+----------+
15| 3 || 3 | 0 | 0 |
| 0 0 0 3 || 0 0 0 3 | | |
| 0 || 0 | | |
--+----------++---------+----------+----------+
EXPECTED INFO needed to classify the sample is the same as above:
I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ],
m = 5
s = 16
si = 3,4,4,2,3 (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16
I = -(3/16*lg2(3/16)+4/16*lg2(4/16)+4/16*lg2(4/16)+2/16*lg2(2/16)
+3/16*lg2(3/16))
= -( -.453 -.5 -.5 -.375
-.453 )
= -( -2.281) = 2.281
ENTROPY based on the partition into subsets by B3 is
E(B3)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ] where
I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj|
The sij's are:
j=1 j=2 j=3
--- --- ---
0 0 3 <-- s1j
0 0 4 <-- s2j
2 2 0 <-- s3j
1 0 1 <-- s4j
3 0 0 <-- s5j
--- --- ---
6 2 8 <- s1j+..+s5j
6 2 8 <- |Aj| (divisors)
0 0 .375 <- p1j
0 0 .5 <- p2j
.67 1 0 <- p3j
.167 0 .125 <- p4j
.5 0 0 <- p5j
0 0 -.531 <- p1j*log2(p1j)
0 0 -.5 <- p2j*log2(p2j)
-.387 0 0 <- p3j*log2(p3j)
-.431 0 -.375 <- p4j*log2(p4j)
-.5 0 0 <- p5j*log2(p5j)
-- --- ---
1.318 0 1.406 <- I(s1j..s5j)=- sum of above
3 2 8 <- s1j+..+s5j
.247 0 .703 <- (s1j+..+s5j)*I(s1j..s5j)/16
.950 <- E(B3) (sum of above)
2.281 <- I(s1..sm)
GAIN(B3) > 1.331 <- I(s1..sm) - E(B3)
Continuing with B4=A={a1..av} used to classify S into {A1..Sv},
---------------------------------------------------------------
Take B4 = {a1,a2} = {11,15} as the 3rd candidate attribute.
Aj={t:t(B4)=aj}, where a1=1101, a2=1111
sij is number of samples of class, Ci, in a subset, Aj.
so sij = rc( P1(ci)^P2(aj) ), where ci is in {2,3,7,10,15}
and aj is in {11,15}.
++---------+----------+
|| P4(11) | P4(15) |
|| 11 | 5 |
-------------|| 3 4 0 4 | 1 0 4 0 |
ci| P1(ci) || 1 | 1 |
==+==========++=========+==========+
2| 3 || 0 | 3 |
| 0 0 3 0 || | 0 0 3 0 |
| 3 || | 3 |
--+----------++---------+----------+
3| 4 || 3 | 1 |
| 4 0 0 0 || 3 0 0 0 | 1 0 0 0 |
| || 1 | 1 |
--+----------++---------+----------+
7| 4 || 4 | 0 |
| 0 4 0 0 || 0 4 0 0 | |
| || | |
--+----------++---------+----------+
10| 2 || 1 | 1 |
| 0 0 1 1 || 0 0 0 1 | 0 0 1 0 |
| 3 0 || 0 | 3 |
--+----------++---------+----------+
15| 3 || 3 | 0 |
| 0 0 0 3 || 0 0 0 3 | |
| 0 || 0 | |
--+----------++---------+----------+
EXPECTED INFO needed to classify the sample is the same as above:
I = I(s1..sm) = -SUM(i=1..m)[ pi * log2(pi) ],
m = 5
s = 16
si = 3,4,4,2,3 (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16
I = -(3/16*lg2(3/16)+4/16*lg2(4/16)+4/16*lg2(4/16)+2/16*lg2(2/16)
+3/16*lg2(3/16))
= -( -.453 -.5 -.5 -.375
-.453 )
= -( -2.281) = 2.281
ENTROPY based on the partition into subsets by B4 is
E(B4)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ] where
I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj|
The sij's are:
j=1 j=2
--- ---
0 3 <-- s1j
3 1 <-- s2j
4 0 <-- s3j
1 1 <-- s4j
3 0 <-- s5j
--- ---
11 5 <- s1j+..+s5j
11 5 <- |Aj| (divisors)
0 .6 <- p1j
.273 .2 <- p2j
.364 0 <- p3j
.091 .2 <- p4j
.273 0 <- p5j
0 -.442 <- p1j*log2(p1j)
-.511 -.464 <- p2j*log2(p2j)
-.531 0 <- p3j*log2(p3j)
-.315 -.464 <- p4j*log2(p4j)
-.511 0 <- p5j*log2(p5j)
-- ---
1.868 1.37 <- I(s1j..s5j)= - sum of above
11 5 <- s1j+..+s5j
1.284 .428 <- (s1j+..+s5j)*I(s1j..s5j)/16
1.712 <- E(B4) (sum of above)
2.281 <- I(s1..sm)
GAIN(B4) > .568 <- I(s1..sm) - E(B4)
and
GAIN(B3) > 1.331 <- I(s1..sm) - E(B3)
GAIN(B2) > 1.750 <- I(s1..sm) - E(B2)
Thus we select B2 as the first level decision attribute.
NOTE: WE GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)
4. Branches are created for each value of B2 and samples are
partitioned accordingly (If a partition is empty, generate a
leaf and label it with the most common class, C2,
labeled with 0011).
.--- B2=0000 - > C2:0011
|--- B2=0001 - > C2:0011
|--- B2=0010 - > Sample_Set_1
|--- B2=0011 - > Sample_Set_2
|--- B2=0100 - > C2:0011
|--- B2=0101 - > C2:0011
|--- B2=0110 - > C2:0011
B2 --|--- B2=0111 - > Sample_Set_3
|--- B2=1000 - > C2:0011
|--- B2=1001 - > C2:0011
|--- B2=1010 - > Sample_Set_4
|--- B2=1011 - > Sample_Set_5
|--- B2=1100 - > C2:0011
|--- B2=1101 - > C2:0011
|--- B2=1110 - > C2:0011
`--- B2=1111 - > C2:0011
Sample_Set_1
X-Y B1 B3 B4
0,3 0111 0101 1011
1,3 0111 0101 1011
Sample_Set_2
X-Y B1 B3 B4
0,1 0011 1000 1111
0,2 0111 0100 1011
1,1 0011 1000 1011
1,2 0111 0100 1011
Sample_Set_3
X-Y B1 B3 B4
0,0 0011 1000 1011
1,0 0011 1000 1011
Sample_Set_4
X-Y B1 B3 B4
2,2 1010 0100 1011
2,3 1111 0100 1011
3,2 1111 0100 1011
3,3 1111 0100 1011
Sample_Set_5
X-Y B1 B3 B4
2,0 0010 1000 1111
2,1 0010 1000 1111
3,0 0010 1000 1111
3,1 1010 1000 1111
NOTE WE DONT NEED TO LIST OUT THE SAMPLE_SETS IN ORDER TO CONTINUE
5. The Algorithm recurses to form decision tree for the samples at
each partition. Once an attribute is the decision attribute at
a node, it is not considered further.
6. Stop when:
a. All samples for a given node belong to the same class or
b. no remaining attributes
(label leaf with majority class among the samples)
We note all samples belong to the same class for nodes:
Sample_Set_1, B2=0010, have class, C3:0111.
Sample_Set_3, B2=0111, have class, C2:0011.
NOTE: ONE CAN GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)
One can determine that these Sample_Sets contain only one
B1 value (class label) from the Ptrees already computed:
++---------+----------+----------+----------+--------
|| P2(2) | P2(3) | P2(7) |P2(10) |P2(11)
|| 2 | 4 | 2 | 4 | 4
--.----------|| 0 2 0 0 | 2 2 0 0 | 2 0 0 0 | 0 0 0 4 | 0 0 4 0
ci| P1(ci) || 13 | 1302 | 02 | |
==+==========++=========+==========+==========+==========+========
2| 3 || 0 | 0 | 0 | 0 | 3
| 0 0 3 0 || | | | | 0 0 3 0
| 3 || | | | | 3
--+----------++---------+----------+----------+----------+--------
3| 4 || 0 | 2 | 2 | 0 | 0
| 4 0 0 0 || | 2 0 0 0 | 2 0 0 0 | |
| || | 13 | 02 | |
--+----------++---------+----------+----------+----------+--------
7| 4 || 2 | 2 | 0 | 0 | 0
| 0 4 0 0 || 0 2 0 0 | 0 2 0 0 | | |
| || 13 | 02 | | |
--+----------++---------+----------+----------+----------+--------
10| 2 || 0 | 0 | 0 | 1 | 1
| 0 0 1 1 || | | | 0 0 0 1 | 0 0 1 0
| 3 0 || | | | 0 | 3
--+----------++---------+----------+----------+----------+--------
15| 3 || 0 | 0 | 0 | 3 | 0
| 0 0 0 3 || | | | 0 0 0 3 |
| 0 || | | | 0 |
--+----------++---------+----------+----------+----------+--------
Thus the decision tree becomes:
.--- B2=0000 - > C2:0011
|--- B2=0001 - > C2:0011
|--- B2=0010 - > C3:0111
|--- B2=0011 - > Sample_Set_2
|--- B2=0100 - > C2:0011
|--- B2=0101 - > C2:0011
|--- B2=0110 - > C2:0011
B2 --|--- B2=0111 - > C2:0011
|--- B2=1000 - > C2:0011
|--- B2=1001 - > C2:0011
|--- B2=1010 - > Sample_Set_4
|--- B2=1011 - > Sample_Set_5
|--- B2=1100 - > C2:0011
|--- B2=1101 - > C2:0011
|--- B2=1110 - > C2:0011
`--- B2=1111 - > C2:0011
Sample_Set_2 (for B2=0011)
X-Y B1 B3 B4
0,1 0011 1000 1111
0,2 0111 0100 1011
1,1 0011 1000 1011
1,2 0111 0100 1011
Sample_Set_4 (for B2=1010)
X-Y B1 B3 B4
2,2 1010 0100 1011
2,3 1111 0100 1011
3,2 1111 0100 1011
3,3 1111 0100 1011
Sample_Set_5 (for B2=1011)
X-Y B1 B3 B4
2,0 0010 1000 1111
2,1 0010 1000 1111
3,0 0010 1000 1111
3,1 1010 1000 1111
Recursing the algorithm on Sample_Set_2 (B2=0011):
1. Subtree starts as single node, S = Sample_Set_2 (determined
by B2=0011, so that ANDing with P2(3) gives correct counts).
2. Not all samples are in the same class (same B1-value),
3. So, use entropy-based measure, "information gain" as a
heuristic for selecting the attribute that will best separate
the samples into individual classes
NOTE: WE CAN GET THIS FAR USING ONLY P-TREES (NO DATABASE SCANS)
We don't have to rescan the training_set to form the leaf
subsample_sets. We can just use the P-tree sets for those samples
That solves the problem (see 1. above).
Revising from 4. onward then (and expressing SubSampleSets as
revised P-trees):
.--- B2=0000 - > C2:0011
|--- B2=0001 - > C2:0011
|--- B2=0010 - > C3:0111
|--- B2=0011 - > Sample_Set_2
|--- B2=0100 - > C2:0011
|--- B2=0101 - > C2:0011
|--- B2=0110 - > C2:0011
B2 --|--- B2=0111 - > C2:0011
|--- B2=1000 - > C2:0011
|--- B2=1001 - > C2:0011
|--- B2=1010 - > Sample_Set_4
|--- B2=1011 - > Sample_Set_5
|--- B2=1100 - > C2:0011
|--- B2=1101 - > C2:0011
|--- B2=1110 - > C2:0011
`--- B2=1111 - > C2:0011
For Sample_Set_2 (for B2=0011=3) (only 2 classes have count>0)
-------------++-------+-------+----------++---------+---------.
\ ||P3(4) |P3(5) |P3(8) ||P4(11) |P4(15) |
\ ||6 |2 | 8 ||11 |5 |
`-------.||0 2 0 4|0 2 0 0| 4 0 4 0 ||3 4 0 4 |1 0 4 0 |
\| 02 | 13 | ||1 |1 |
ci|P1(ci)^P2(3)=======+=======+==========++=========+=========|
3| 2 ||0 |0 | 0 ||1 |0 |
| 2 0 0 0 || | | ||1 0 0 0 | |
| 13 || | | ||1 | |
--+----------++-------+-------+----------++---------+---------|
7| 2 ||2 |0 | 0 ||2 |0 |
| 0 2 0 0 ||0 2 0 0| | ||0 2 0 0 | |
| 02 || 02 | | || 02 | |
--+----------++-------+-------+----------++---------+---------
EXPECTED INFO needed to classify the sample:
I = I(s1,s2) = -SUM(i=1,2)[ pi * log2(pi) ],
m = 2 s = 16
si = 2,2 (rootcounts for the class labels, rc(P1(ci))'s)
pi = s1/s = 1/8, 1/8
I = -(2/16*lg2(2/16) + (2/16*lg2(2/16))
= -( -.375 -.375 ) = .750
________________________________________________________
ENTROPY based on the partition into subsets by B3 is
Take B3 = {a1,a2,a3} = {4,5,8} as the 1st candidate attribute.
Aj={t:t(B3)=aj}, where a1=0100, a2=0101, a3=1000,
sij is number of samples of class, Ci, in a subset, Aj.
sij=rc(P1(ci)^P2(aj)) where ci in {3,7} and aj in {4,5,8}
ENTROPY based on the partition into subsets by B3 is
E(B3)=SUM(j=1..v)[ (s1j+..+smj)*I(s1j..smj)/s ] where
I(s1j..smj)=-SUM(i=1..m)[pij*log2(pij)], pij=sij/|Aj|
The sij's are:
j=1 j=2 j=3
--- --- ---
0 0 0 <-- s1j
2 0 0 <-- s2j
--- --- ---
2 0 0 <- s1j+..+s3j
2 0 0 <- |Aj| (divisors)
0 undefined undefined <- p1j
1 undefined undefined <- p2j
(the undefined terms are dropped)
0 <- p1j*log2(p1j)
0 <- p2j*log2(p2j)
-- --- ---
0 <- I(s1j....s3j)=- sum of above
2 0 0 <- s1j+..+s3j
0 0 0 <- (s1j+..+s3j)*I(s1j..s3j)/16
0 <- E(B3) (sum of above)
.75 <- I(s1..sm)
GAIN(B3) = .75 <- I(s1..sm) - E(B3)
Continuing with B4=A={a1..av} used to classify S into {A1..Sv},
---------------------------------------------------------------
Take B4 = {a1,a2} = {11,15} as the 2nd candidate attribute.
Aj={t:t(B4)=aj}, where a1=1101, a2=1111
sij is number of samples of class, Ci, in a subset, Aj.
so sij = rc( P1(ci)^P2(aj) ), where ci is in {3,7}
and aj is in {11,15}.
The sij's are:
j=1 j=2
--- ---
1 0 <-- s1j
2 0 <-- s2j
--- ---
3 0 <- s1j+s2j
3 0 <- |Aj| (divisors)
.33 undefined <- p1j
.67 undefined <- p2j
(the undefined terms are dropped)
-.541 <- p1j*log2(p1j)
-.387 <- p2j*log2(p2j)
-- --- ---
.928 <- I(s1j,s2j)=- sum of above
3 0 0 <- s1j+s2j
.174 0 0 <- (s1j+s2j)*I(s1j,s2j)/16
.174 <- E(B3) (sum of above)
.75 <- I(s1..sm)
GAIN(B4) = .576 <- I(s1..sm) - E(B4)
GAIN(B3) = .75 <- I(s1..sm) - E(B3)
So B3 is the decision attribute and so forth.
Note that no database scan has been needed at all!
ID3 DTI
Bayesian Classification
Bayesian classifiers are statistical classifiers
7.4.1 Bayes Theorem
Let X be a data sample whose class label is unknown.
Let H be a hypothesis (ie, X belongs to class, C).
P(H|X) is the posterior probability of H given X.
P(H) is the prior probability of H.
Bayes Theorem:
P(H|X) = P(X|H)P(H)/P(X)
7.4.2 Naive Bayesian Classification
1. Each data sample is represented by feature vector, X=(x1..,xn)
depicting the measurements made on the sample from A1,..An, resp.
2. Given classes, C1,...Cm, the naive Bayesian Classifier will
predict unknown data sample, X (with no class label), belongs to
class, Cj (called the maximum posteriori hypothesis), having the
highest posterior probability, conditioned on X
( P(Cj|X) > P(Ci|X), i not j).
P(Cj|X) = P(X|Cj)P(Cj)/P(X)
3. P(X) is constant for all classes so we maximize P(X|Cj)P(Cj).
If we assume equal liklihood of classes, maximize P(X|Cj),
else P(Ci) estimated as si/s.
From the PC-cube we see that s is the overall tuple count
and si is the rootcount of DRollup[Bcube->C]i
(thus, it is rc(VPCtree[Ci]) assuming C=Bn = rc(PCn1* AND
... AND PCnm* where there Ci is m-bit string and there is
a * for each 0 bit in the string)
4. To reduce the computational complexity of calculating all
P(X|Cj)'s the naive assumption of conditional independence of
values is often made (therefore the name "Naive Baysian"),
thus, P(X|Ci)=P(xk|Ci)*..*P(xn|Ci).
For categorical attributes, P(xk|Ci)=sixk/si where sixk= # of
training samples of class, Ci, having Ak-value xk
(PCn,Ci ^ PCk,xk, which is one AND program).
For continuous attributes, use Gaussian distribution to estimate
P(xk|Ci).
Once the P(xk|Ci)'s are estimated, the model is "trained".
Example:
Consider the training set, S, where B1 is the class label attribue
S:
B1 B2 B3 B4
0011 0111 1000 1011
0011 0011 1000 1111
0111 0011 0100 1011
0111 0010 0101 1011
0011 0111 1000 1011
0011 0011 1000 1011
0111 0011 0100 1011
0111 0010 0101 1011
0010 1011 1000 1111
0010 1011 1000 1111
1010 1010 0100 1011
1111 1010 0100 1011
0010 1011 1000 1111
1010 1011 1000 1111
1111 1010 0100 1011
1111 1010 0100 1011
__C1___ __C2___ __C3___ __C4___ __C5___
P1,0010 P1,0011 P1,0111 P1,1010 P1,1111
3 4 4 2 3
0 0 3 0 4 0 0 0 0 4 0 0 0 0 1 1 0 0 0 3
P2,0010 P2,0011 P2,0111 P2,1010 P2,1011
2 4 2 4 4
0 2 0 0 2 2 0 0 2 0 0 0 0 0 0 4 0 0 4 0
s1x2=0010 s1x2=0011 s1x2=0111 s1x2=1010 s1x2=1011
0 0 0 0 1 <-- s1x2/s1
0 .5 .5 0 0 <-- s2x2/s2
.5 .5 0 0 0 <-- s3x2/s3
0 0 0 .5 .5 <-- s4x2/s4
0 0 0 1 0 <-- s5x2/s5
__C1___ __C2___ __C3___ __C4___ __C5___
P1,0010 P1,0011 P1,0111 P1,1010 P1,1111
3 4 4 2 3
0 0 3 0 4 0 0 0 0 4 0 0 0 0 1 1 0 0 0 3
P3,0100 P3,0101 P3,1000
6 2 8
0 2 0 4 0 2 0 0 4 0 4 0
s1x3=0100 s1x3=0101 s1x3=1000
0 0 1 <-- s1x3/s1
0 0 1 <-- s2x3/s2
.5 .5 0 <-- s3x3/s3
.5 0 .5 <-- s4x3/s4
1 0 0 <-- s5x3/s5
__C1___ __C2___ __C3___ __C4___ __C5___
P1,0010 P1,0011 P1,0111 P1,1010 P1,1111
3 4 4 2 3
0 0 3 0 4 0 0 0 0 4 0 0 0 0 1 1 0 0 0 3
P4,1011 P4,1111
11 5
3 4 0 4 1 0 4 0
s1x4=1011 s1x4=1111
0 1 <-- s1x4/s1
.75 .25 <-- s2x4/s2
1 0 <-- s3x4/s3
.5 .5 <-- s4x4/s4
1 0 <-- s5x4/s5
5. In order to classify an unknown sample, X, P(X|Ci)P(Ci) is
evaluated for each i, then X is assigned to the class for which
it is maximum. ( Evaluate, P(xk|Ci)*..*P(xn|Ci) * P(Ci) )
s1x2=0010 s1x2=0011 s1x2=0111 s1x2=1010 s1x2=1011
0 0 0 0 1 <-- s1x2/s1
0 .5 .5 0 0 <-- s2x2/s2
.5 .5 0 0 0 <-- s3x2/s3
0 0 0 .5 .5 <-- s4x2/s4
0 0 0 1 0 <-- s5x2/s5
s1x3=0100 s1x3=0101 s1x3=1000
0 0 1 <-- s1x3/s1
0 0 1 <-- s2x3/s2
.5 .5 0 <-- s3x3/s3
.5 0 .5 <-- s4x3/s4
1 0 0 <-- s5x3/s5
s1x4=1011 s1x4=1111
0 1 <-- s1x4/s1
.75 .25 <-- s2x4/s2
1 0 <-- s3x4/s3
.5 .5 <-- s4x4/s4
1 0 <-- s5x4/s5
sixk/si's: si/s P(X|Ci)=P(Ci)
x2 x3 x4 ------------ ---- -------------
Take X= 0011 1000 1011 0 1 0 3/16 0
1/2 1 3/4 4/16 3/32
1/2 0 1 4/16 0
0 1/2 1/2 2/16 0
0 0 1 3/16 0
So X is classified as C2.
So we see that, once the conditional probabilities, sixk/si, are
derived from the P-trees, any new sample can be classified
instantly.
How effective are Naive Bayesian Classifiers?
- In theory they have low error rates in comparison to other
classifiers.
- in practice it is not always true, because the assumptions
may not be valid.
- Various empirical studies have found Naive Bayesian
Classifiers to be comparable to decision tree and neural
network classifiers in many domains.
- They also provide a theoretical justification for other
classifiers that do not explicitly use Bayes Theorem
(e.g., under certain assumptions it can be shown that NN and
curve-fitting algorithms (eg, ID3) output the "maximum
posteriori hypothesis" as does the Naive Bayesian Classifier.
7.4.3 Bayesian Belief Networks (to handle cases where the naive
assumption doesn't hold)
- The Naive Assumption of "class conditional independence"
(given the class label of a sample, the values of the
attributes are conditionally independent of one another)
which allows use of the simplifying formula:
P(X|Ci)=P(xk|Ci)*..*P(xn|Ci), when true, produces the most
accurate classifier of all.
- In practice dependencies can exist between attributes
(variables).
- In spatial datasets, one approach would be to select out
attributes that are independent and then use Naive
Bayesian Classifiers. (e.g., select RIR and leave out
G since there is correlation between them)
- However, with PC-trees we have a way to calculate P(X|Ci)
directly.
It is the AND of the tuple PC-tree for X with the value
PC-tree for Ci (noting that X is a tuple in Rel[X]
not Rel - eliminating Coord and C)
- Bayesian Belief Networks specify the joint conditional
probabilities and allow class conditional independence
to be defined between subsets of attributes (variables)
namely those subsets that are conditionally
independent of oneanother.
- Note that the notion of functional dependence in
normalizing relations
is a specification of conditional dependence.
A Belief Network (or Bayesian Belief Network or Bayesian
Network or Probabilistic Network) is composed of two
components,
1. an acyclic directed graph (nodes=attributes or random
variables; edges=variables (actual attributes or "hidden
variables" such as medical syndrome in medical data)
- each variable is conditionally independent of its
non-descendents, given its parents.
2. a Conditional Probability Table (CPT) for each variable,
Z, specifying all P(Z|parentZ).
7.4' Non-Naive Baysian Classifier (New section, shortcut to
Baysian Belief Net for spatial data with Ptrees).
We can use Baysian Classification directly without the Naive
assumption, since we do not need to make the simplifying Naive
assumption that P(X|Ci)=P(xk|Ci)*..*P(xn|Ci) since we can
compute the actual P(X|Ci) directly (in fact it is a simpler
program than the above) as: TPC(X) ^ VPC(Ci).
We do not need Baysian belief networks to estimate these numbers!
Bayesian Classifiers
Classification by Backpropagation
- A Neural network is a set of connected input/output units
- Each connection has a weight
- In the learning phase, adjusts weights to learn to
predict class of input samples.
- Backpropagation is a particular Neural Network learning alg
- It operates on a "multilayer feed-forward network"
Multi-layer Feed-Forward Neural Network
Input Hidden Output
layer layer layer
.----. .----. .----.
x1 | |----------| |----------| |- >
`----'-. .-`----'-. .-`----'
\ `-..-' / \ `-..-' /
.----..\-' `-/..----..\-' `-/..----.
x2 | |--\----/--| |--\----/--| |- >
`----'\ \ / /`----'\ \ / /`----'
. \ \/ / . \ \/ / .
`./\.' `./\.'
. /\/\ . /\/\ .
/ /\ \ / /\ \
. /.' `.\ . /.' `.\ .
.----./' `\.----./' `\.----.
xi | |----------| |----------| |- >
`----' wij `----' wjk `----'
Oj Ok
- Inputs correspond to attributes from training samples.
- Weighted outputs of the Input units are fed to the Hidden
units (many Hidden layers?).
- Weighted outputs of last Hidden layer's units are fed to the
Output units.
- Output units emit the network's prediction for the given
samples.
- Hidden and Output units are often referred to as "neurodes"
- A n-layer NN has n layers other than the Input layer
(includes Hidden and Output).
- NN is "feed-forward" since none of the weights cycle back to
an input unit or output unit of a previous layer.
Defining a Network Topology
- Specify the number of Input units
- Specify the number of Hidden layers
- Specify the number of Hidden units in each Hidden layer
- Specify the number of Output units
- Normalize the input values for each attribute in training
set speeds up training.
Backpropagation
- learns by iteratively,
- processing a set of training samples,
- comparing the network's prediction for each sample with
the actual known class label.
- For each training sample, weights are modified to minimize
mean-square error between the network's prediction and the
actual class.
- These modifications made in a "backwards" direction, from
Output layer, through each Hidden layer to the Input layer
- The weights will (usually) eventually converge, and the
learning process stops.
The Backpropagation Algorithm:
(1) Initialize all weights and biases in "network";
(2) while terminating condition is not satisfied {
(3) for each training sample X in "samples"
(4) // Propagate the inputs forward:
(5) for each hidden or output layer unit j {
(6) Ij = SUM(i)[ wij*Oi + theta(j) ]
//compute the net input of the
unit j wrt previous layer, i//
(7) Oj = 1/(1+e^(-Ij);} //compute the output of
each unit, j//
(8) // Backpropagate the errors://
(9) for each unit j in the output layer
(10) Errj = Oj(1-Oj)(Tj-Oj);
// compute the error with respect to the next higher
layer, k//
(11) for each unit j in the hidden layers, from last to
1st hidden layer
(12) Errj = (l)*Errj*Oi // weight increment
(13) for each weight wij in "network" {
(14) DELTAwij = (l)*Errj*Oj // weight increment
(15) wij = wij + DELTAwij} // weight update
(16) for each bias THETAj in "network" {
(17) DELTA(THETAj) = (l)*Errj; //bias increment
(18) THETAj = THETAj + DELTA(THETAj) } // bias update
(19) }}
(1) The weights in the network are initialized to small
random numbers (eg, -1 to 1 or?).
Each unit has a "bias" also initialized to a small random num.
For the jth layer (as inputs to the jth layer, where the Input
layer is 0) j=1..m (m+1 layers total, including input):
| w11 w12... w1n1 |
| w21 w22... w2n1 |
| w31 w32... w3n1 |
| . | = W1
| . |
| . |
|wn01 wn02.. wn0n1|
| z(1)1 |
| z(1)2 |
| . | = Z1
| . |
| z(1)n1|
etc.
(4)-(7) Net input to Hidden/Output unit,
j: Ij=SUM(i)[wij*Oi+zj] where wij is the weight of the
condition from unit, i, in previous layer to unit, j;
Oi is the output of unit i; zj is the bias of the unit
(threshold -varying activity of the unit)
Each units takes its net input; applies an "activation function"
- symbolized activation of the neuron
- logistic or simoid (or "squashing fctn since it maps
a large input domain into [0,1]) is used: Given net
Input, Ij, to unit j, then the output of unit j
is: O'j = 1/(1+e^-Ij)
- the logistic fctn is nonlinear and differentiable,
allowing backprop algorithm to model classification
problems that are linearly inseparable
So the output, O'j, of unit-j, given
- output from previous-layer, unit-i of Oi,
- connection weight, wij,
- bias zj, is:
(O1 O2..Onj-1) | w11 w12 ... w1nj | + ( z1 z2 ... znj )
= ( I1 ... Inj )
| w21 w22 ... w2nj |
| w31 w32 ... w3nj |
| . |
| . |
| . |
|wnj-11...wnj-1,nj |
and at each layer,
_____________1________________
Oj = f(Ij) = -(SUM(i)[wij*)i+zj]
1 + e
We will write it using matrix motation as follows:
At layer j, the
from previous layer, outputs are Oj-1
weights are Wj
inputs are Ij
outputs are Oj (after applying activation
fctn)
O(j-1)*Wj+Zj => Ij => Oj=f(Ij)
(8)-(18) The error is propagated backwards once the output of
the Output layer is computed, to update weights and biases.
For Output unit, Om, Errm=Om(1-Om)(Tm-Om); Om is "actual"
output and Tm is the "true" output based on the known class
label of the training sample
Noting that for f(x)= 1/(1-e^-x), f'(x)= e^-x / (1+e^-x)^2
and f(x)*(1-f(x))= 1/(1-e^-x) * (1 - 1/(1+e^-x)) =
e^-x / (1+e^-x)^2
we see that we are just using a straight line assumption as to
the input DELTA value since DELTA(x) = y' * DELTA(y),
where y'=Om(1-Om) and DELTA(y) = (Tm-Om)
The error in a Hidden layer-j, use the weighted sum of the errors
of the units connected to j from the next layer:
Errj = Oj(1-Oj)*SUM(k)[Errk*wjk]
where wjk=weight of connection from unit-j to a unit-k in the
next higher layer and Errk is the error of unit-k.
Weights are updated: DELTAwij = (l) * Errj * Oj
and wij = wij * DELTAwij
where l=learning rate, a constant, typically in (0,1).
- Backpropagation learns using a method of gradient descent
to search for a set of weights that can model the given
classification problem so as to minimize the mean squared
distance between the network's class prediction and the
actual class label of samples.
The learning rate helps to avoid getting stuck at a local
minimum in decision space; if too low, learning is very slow.
If too high, thrashing between suboptimals can occur.
A rule of thumb is to set the learning rate to 1/t
where t=number of iterations through the training set so far.
Biases are updated: DELTA(zj) = (l) * Errj
Here we are updating the weights and biases after presentation of
each sample (case updating). Alternatively, weight and bias
updates (DELTAs) can be accumulated in variables so that
updating can be applied after the entire training set has been
presented (epoch updating).
(one iteration through the training set is an epoch)
In theory (mathematical) epoch updating is better, yet in
practice, case updating is more common since it tends to yield
more accurate results.
(2)-(3) Training stops when either
- all DELTAwij in the previous epoch were so small as to be
below some threshold or
- the % of samples misclassified in the previous epoch is
below some threshold or
- a pre-specified number of epochs has expired.
(in practice several hundred thousand epochs may be required.)
Input Hidden Output
.----. .----. .----.
x1 | x1 |----------| X1 |----------| y1 |- > y1
`----'-. .-`----'-. .-`----'
\ `-..-' / \ `-..-' /
.----..\-' `-/..----..\-' `-/..----.
x2 | x2 |--\----/--| X2 |--\----/--| y2 |- > y2
`----'\ \ / /`----'\ \ / /`----'
. \ \/ / . \ \/ / .
`./\.' `./\.'
. /\/\ . /\/\ .
/ /\ \ / /\ \
. /.' `.\ . /.' `.\ .
.----./' `\.----./' `\.----.
xI | xI |----------| XJ |----------| yK |- > yK
`----' wij zj`----' Wjk ZK`----'
(x1..xI)*|w11..w1J|+|z1|=>f=>(X1..XJ)*|W11..W1K|+|Z1|=>f=>(y1..yK)
| . . | |. | |. . | |. |
|wI1..wIJ| |zJ| |WJ1..WJK| |ZK|
**************************************************************
Other Classification Methods
k-nearest Neighbor Classifiers (based on learning by analogy
- unknown samples are assigned to the most common class among
its k-nearest neighbors in n-space.
- instance based.
- lazy or "as you go" learner (by contrast to decision trees where
the classifier is constructed before new samples are considered)
- With respect to spatial data in REL organization, if B1 is the
class label attribute, what should be meant by the k-nearest
ngbrs?
- Let's assume there is one REL dataset for learning and the new
samples are separate from it (e.g., for RGBY data, take the
point of view that we use last years dataset with RGB and Y
to train and are interested in classifying this years RGB data
to predict the Y).
A Spatial k-nearest ngbr algorithm
Assume we have basic Ptrees for the training set.
We find the k-nearest ngbrs to a new sample, x, and then
predict the class of x to be the majority class among those
k ngbrs.
So we will find the closest k (or more) training tuples, based on
a weighted Manhattan distance on the non-class attribute values
(e.g., if B1 is the Class label attribute,
wm_dis(x,y) = SUM(i=2..n)[wi*|yi-xi|], where 0= k done.
(class label is the one that gives the max rootcount when its
Ptree is ANDed with Px - i.e., we compute rc(Px^Pci) for each
class label, ci and assign the one that gives max rootcount.)
2. If rc(Px) < k, remove the lowest-order bit from the
highest-weight band value of x,
(we will call the resulting tuple, x also - since it is just
the original tuple x, with its Bi-value generalized one
level up the value concept hierarchy to a 7-bit value instead
of an 8-bit value).
Repeat 1 and 2 until rc(Px) >= k
(note, when we have removed the low order or 8th bit from all
of the non-class-label attributes of x, we proceed to removing
the 7th bit one attribute at a time, then the 6th bit and so
forth.)
(note, we can decide to remove several bits at a time so as to
reduce the complexity. We may get a ngbr set that has many more
than k ngbrs in it but that shouldn't be a problem. If for some
reason it seems important to get the smallest ngbr set that
qualifies (closest to k) rather than ordering the attributes by
"importance" we could calculate the ngbr set size for each
attribute during each "bit removal pass" and pick the one that
gives us the best ngbrset... Lots of variations are possible.)
(note, while calculating the rc's above it would make sense to
have an accumulator for the rc's for each attribute values
for several of the passes (8-bit, 7-bit, ... values).
This can be done with a single scan parallel program (lots of
accumulators however). This gives us maximum flexibility in
deciding the best ngbrset. We could also be computing the
Px^Pci rootcounts during this one single scan pass).
(Note, in the event that we get through 1-bit values without the
ngbrset reaching size, k, (could that happen?
How? and if so, what could be done about it?) we could make
resort to the traditional training set scan to classify that
particular new sample.)
Example:
Traning Dataset (B1 is the class label attribute and k=5):
X-Y B1 B2 B3 B4
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011
0,3 0111 0010 0101 1011
1,0 0011 0111 1000 1011
1,1 0011 0011 1000 1011
1,2 0111 0011 0100 1011
1,3 0111 0010 0101 1011
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011
3,0 0010 1011 1000 1111
3,1 1010 1011 1000 1111
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011
Consider the new sample is: x = ---- 1011 1000 1111
The basic PC_trees in PQ-list form:
PQ11: 23 3
PQ12: 1 31 32 33
PQ13: pure
PQ14: 0 1 31 32 33
PQ21: 2 3
PQ22: 00 02
PQ23: 0 1 2 3
PQ24: 0 10 12 2
PQ31: 0 2
PQ32: 1 3
PQ33: null
PQ34: 11 13
PQ41: pure
PQ42: 01 2
PQ43: pure
PQ44: pure
Assume the weights order the bands from high-to-low B2,B3,B4
Consider the new sample is: x = ---- 1011 1000 1111
C = {0010 0011 0111 1010 1111} (class labels)
The needed PQ-seq's are:
PQ1,0010: 20 21 22
PQ1,0011: 0
PQ1,0111: 1
PQ1,1010: 23 30
PQ1,1111: 31 32 33
PQ2,1011: 2
PQ3,1000: 0 2
PQ4,1111: 00 02 2
1. If rc(Px) >= k done. (class is s.t. rc(Px^Pci) is max.)
Px: 2 rc(Px)=4 NOT >= k=5
2. If rc(Px) < k, loworder bit from next band value...
Take off the loworder bit from B2:
PQ2,101 : 2 3 (gives the same result for Px so do same with B3)
PQ3,100 : 0 2 (gives the same result)
PQ4,111 : 00 02 2 (gives the same result)
next loworder bit removal:
PQ2,10 : 2 3 (gives the same result for Px so do same with B3)
PQ3,10 : 0 2 (gives the same result)
PQ4,11 : 00 02 2 (gives the same result)
next loworder bit removal:
PQ2,1 : 2 3 (gives the same result for Px so do same with B3)
PQ3,1 : 0 2 (gives the same result)
PQ4,1 : pure
------------
PQx: 2 (gives the same result)
next loworder bit removal:
PQ2,1 : pure
PQ3,1 : 0 2
PQ4,1 : pure
------------
PQx: 0 2 has rc = 8 >= 5.
rc(Px) >= k, class is s.t. rc(Px^Pci) is max.)
PQ1,0010: 20 21 22
PQx: 0 2
--------------
20 21 22 rc= 3
PQ1,0011: 0
PQx: 0 2
--------------
0 rc= 4
PQ1,0111: 1
PQx: 0 2
--------------
null rc= 0
PQ1,1010: 23 30
PQx: 0 2
--------------
23 rc= 1
PQ1,1111: 31 32 33
PQx: 0 2
--------------
null rc= 0
Thus, the class label for x is 0011
*********Notes *********************************************
Problems?
1. Consider the problem of a ngbr that is positioned in
large numbers right near a quadrant boundary, so that
it has ngbrs which don't appear to be ngbrs in the Ptree.
(This may not be a problem, since we are dealing
with whole values. The real problem is 2.)
2. For a value like, 0111, note that it is at the edge,
not the middle of the intervals,
[0110,0111],
[0100,0111],
which are the ngbrhds used when removing the first 2
low-order bits (note that the same thing happens with 1111
but it is inevitably at the edge of all ngbrhds,
while 0111 is not.)
Better, 1st "nbrd" be [0110,1000] = [6,8]
2nd, [0100,1001] = [4,9].
Or even better, 1st: [0110,1000] = [6,8],
2nd: [0101,1001] = [5,9].
[0111,0111] [7,7]
[0110,1000] [6,8]
[0101,1001] [5,9]
Question:
In removing a loworder bit, can it be accomplished by ORing? e.g.,
To get:
PQ2,101 : 2 3
can we just OR:
PQ2,1011: 2
PQ24': 11 13 3
OR---------------
11 13 2 3
where PQ24' is the comp of PQ24: 0 10 12 2
apparently not!
Note:
P2,101 = P2,1010 v P2,1011 = (P2,101 ^ P24') v P2,1011 =
= (P2,101 v P2,1011) ^ (P24' v P2,1011)
= P2,101 ^ (P24' v P2,1011)
It's clear there is no way to construct, e.g., P21 from P2,11
and the basic, P22 or its comp, since P2,11 is 1 where both
P21 and P22 are 1. Knowing where P22 is 1 doesn't tell me
which of the pixels for which P22 is 0 have a 1 in P21.
That is to say, a 0 in P2,11 where P22 is also 0 tells me
nothing about P21 at those pixels (it could be 0 or 1).
Therefore we need to retain all info on a subcube as we go to
avoid further ANDing:
So, to answer the classification question (using our "nearest
ngbr" like approach) we need to have filled in a cube:
Consider, again, the new sample: x = ---- 1011 1000 1111 and
C = {0010 0011 0111 1010 1111} (class labels).
We need the cube bounded by all of 5 B1-values
(the entire B1 dimension) and
P2,
1011 [11,11]
101 1100 [10,12]
1001 1101 [ 9,13]
1 111 [ 8,14]
0111 1111 [ 7,15]
0110 [ 6,15]
0101 [ 5,15]
01 [ 4,15]
0011 [ 3,15]
0010 [ 2,15]
0001 [ 1,15]
Of these, the ones we see in the basic algorithm
(removal of loworder bits) are:
P2,
1011 [11,11]
101 [10,11] not seen above
10 [ 8,11] not seen above
1 [ 8,15] not seen above
If we also include those needed to balance the intervals:
P2,
1011 [11,11]
101 1100 [10,12] not seen above
10 111 [ 8,14] not seen above
1 [ 8,15] not seen above
How far out should the intervals go before we stop
(and consider the new sample an outlier - at which point
we take the majority class of the ngbr-set, if there is
one, else take the majority class of the sample space)?
- One thought would stop after Radius =
ROOF{SQRT(|S|) / ROOF[SQRT(|S|/k)]}
Rationale: If the samples are uniformly distributed with
duplicity=k, each duplicity group would be at and
intersection of grid lines with the above spacing.
- SQRT(|S|/k) = SQRT(16/5) = 1.78, roof is 2.
so R = 4 / 2 = 2
P2,
1011 [11,11]
101 1100 [10,12]
10 111 [ 8,14]
P3,
1000 [ 8, 8]
0111 1001 [ 7, 9]
011 1010 [ 6,10]
P4,
1111 [15,15]
111 1111 [14,15]
11 1111 [13,15]
then once the ngbrset is found, AND with the following
to classify
P1,0010
0011
0111
1010
1111
Misc Classification