This material corresponds to chapter 26 of E-N.

Here are some additional references

 INDSU DataMining Survey
 Rakesh Agrawal's DataMining Papers
 Standford DataMining Group
 DataMining Research Opportunities and Challenges
 Analysis of DataMining Algorithms
 QUEST Data Mining

INTRODUCTION:

What is Data Mining

Data mining is the most general form of database querying
   (in which the query is not clearly or precisely defined).

   - Usually the goal is to discover "interesting" relationships where all we have is a notion of
     what we consider interesting rather than a clear question to ask of the database.


Data mining is an important form of decision support.

Decision support tools are tools used to help make good decisions.

Data mining allows precise queries or questions to be developed incrementally without having a goal
   notion at the beginning


DATA MINING is extracting previously unknown or unclear, and potentially useful information
(eg, rules, constraints, correlations, patterns, signatures and irregularities).

It is focused on automated methods for extracting patterns and/or models from a database.
   - the type of database is usually a data warehouse (where the patterns don't change).


Data mining and data warehousing are new important areas of database technology.


In business, a very very successful data mining method is called "Market Basket Research"

   -also called Association Rules Mining (ARM)

   -invented, for the most part by Walmart - now used by all successful companies,

  -analyze the "customer transaction" file(s) for patterns or "associations".

  -help make business decisions (eg, decide inventory, decide pricing, arrange shelf space,
        choose sale items, design coupons, etc.)



An anecdotal example which is typically given is:

"Customers who buy SportsIllustrated and Diapers also buy Beer"

This example has become well-known as the kind of discovery that can happen through
data mining of cash register receipt data or so called "market basket research".

The association rule is "{SI, Diapers} => {Beer}
  -means:  it appears to be the case that shopping carts or market baskets
   (customer transactions) that contain both SI and Diapers  will also contain beer



Interest measures (what makes a data mining result interesting?)

"SUPPORT" of the rule = % of baskets containing SI, Diapers and Beer.
"CONFIDENCE" is % baskets with SI and Diapers which also have Beer.


If the above rule has high support and confidence, it suggests Beer sales
will increase if beer is shelved close to Pampers or SI.


High support and confidence indicates consequential rules
     -rules that matter
     -Walmart-type datamining is a matter of searching through their daily
      cash register receipts database for unexpected rules of high support and confidence.



In precision agriculture, a field is subdivided into "pixels" or points (usually evenly).

   0   1   2   3   4   5   6   7   8   9  10  11  12 
 .---.---.---.---.---.---.---.---.---.---.---.---.---.
0|   |   |   |   |   |   |   |   |   |   |   |   |   |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|
1|   |   |   |   |   |   |   |   |   |   |   |   |   |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|
2|   |   |   |   |   |   |   |   |   |   |   |   |   |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|
3|   |   |   |   |   |   |   |   |   |   |   |   |   |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|
4|   |   |   |   |   |   |   |   |   |   |   |   |   |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|
5|   |   |   |   |   |   |   |   |   |   |   |   |   |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|
6|   |   |   |   |   |   |   |   |   |   |   |   |   |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|
7|   |   |   |   |   |   |   |   |   |   |   |   |   |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|
8|   |   |   |   |   |   |   |   |   |   |   |   |   |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|
.|   |   |   |   |   |   |   |   |   |   |   |   |   |
.
.

These pixels are order in raster order:
0,0  0,1  ...  0,12   1,0  1,1  ...  1,12  . . .  8,0 8,1 ... 8,12 ...
first row             second row                  eighth row

The reflectance levels within given spectral ranges (e.g., Red, Green, Blue,...)
    are captured by a sensor and recorded in files called BANDs

RED-band
pix refl
0,0  24
0,1  26
0,2  49
0,3  68
0,4  93
0,5 119
.
.
.

The key for each band is the x,y coordinates.
  This attribute is usually omitted since the raster ordering is taken to be understood.


So a "BAND" is either a single attribute file of the relative reflectance levels (expressed as numbers in [0, 255])

 observed in a particular color range (or non-visible range such as infra-red...)
or an
 agricultural band (yield levels - e.g., bushels per acre for each pixel).


An association rule example:
"At points in a field where the midsummer,

 Near-Infrared (NIR) reflectance is greater than 48 and
 Red reflectance is less than 31,

 the yield will be greater than 128 bu/acre"


The rule is   { NIR>47, R<32 }   =>   { Y>128 }

  - the set, { NIR>47, R<32 } is called the "antecedent" of the rule
  - the set  { Y>128 } is called the "consequent" of the rule


The "SUPPORT" of the rule = % (or ratio) of pixels with NIR>47 and R<32 and Y>128.

   - as a ratio, it can be expressed  |antecedent UNION consequent| / Total


The "CONFIDENCE" = % (or ratio) of pixels with NIR>47 and R<32 which also have Y>128.

   - as a ratio it can be expressed  |antecedent UNION consequent| / |antecedent|


If the support and confidence of this rule is high, that suggests to
the producer that nitrogen fertilizer should be applied in the locations
where NIR<47 and/or R>32, so as to maximize the yield in those areas
(get it up over 128 Bu/acre).






DATA MINING (more generally:  Three main methods
            Association Rule Mining,
            Classification,
            Clustering

(Note that this topic overlaps with Artificial Intelligence.)


1. CLASSIFICATION ~= anti-graphing, ie, given a graph/table representation of a
relationship, find a closed form (approx?) functional representation of it.

    (the closed form of the classifier can be a
     decision table, an association rule, a functional dependency, etc.)


Recall in graphing, one has a closed form function,  y = f(x)

and one creates a table of pairs from it:      x    f(x) 
                                              ---   ---
and "graphs" those pairs:                      1     4
 4 -|  *                                       2     2
 3 -|                                          3     1
 2 -|     *                                    4     0
 1 -|        *     *                           5     1
 0 -`--+--+--+--*--+--              
       1  2  3  4  5 



In classification we have the table,   x   y  and we seek some closed form representation
                                      --- ---     or the characteristics of x that
                                       1   4      produce the various y values.  Of
                                       2   2      course the values may not be numeric.
                                       3   1
                                       4   0
                                       5   1


  - For continuous data, called "regression" - numeric values prediction (statistics)

  - For relations, we define PREDICTING attribute(s) and
                             GOAL (or class label) attribute(s).  

    Classification partitions the tuples of a relation according to their expected
    "goal" attribute value or class, by.

      1. defining a "training" or learning set of tuples (which have the goal attribute included),

      2. defines a "test" set of tuples (exclusive of training set, but which also have the goal attr),

      3. finds, for example, if-then classifier rules such that
           if the predicting attribute-value(s) satisfy the antecedent
           then tuple's goal value is in the class predicted in rule consequent with some high probability.


                    Predicting        Goal
                        /\             |
                     .-' `-.           |
                    /       \          v 
                     A1   A2           C
                    ______________________  _
                   |____|____|       |____|  \ 
                   |____|____| . . . |____|   \
                   |____|____|       |____|    \
                   |____|____|       |____|     } Training set of tuples
                   |____|____|       |____|    /
                   |____|____|       |____|   /
                   |____|____|       |____| _/
                   |____|____|       |____|  \
                   |____|____|       |____|   \
                   |____|____|       |____|    \
                   |____|____|       |____|     } Test set of tuples
                   |____|____|       |____|    /
                   |____|____|       |____|   /
                   |____|____|_______|____| _/



                     A1   A2           C
                    ______________________  _
                   |____|____|       |_?__|  \ 
                   |____|____| . . . |_?__|   \
                   |____|____|       |_?__|    \
                   |____|____|       |_?__|     } New set of tuples on which we
                   |____|____|       |_?__|    /  wish to predict the goal value
                   |____|____|       |_?__|   /   based on the predicting values.
                   |____|____|       |_?__|  /
                     .
                     .
                     .


  - The aim of classification is to discover (approximate?) relationships between predicting and goal
    attributes so the relationship can be used to predict the goal-attribute value or class
    of new tuples which have only the predicting attributes.





  - Two criteria to evaluate the quality of discovered rules

    * Error rate (predictive accuracy) = (#_TestTuples_Misclassified)/(#_TestTuples)

    * Comprehensibility = eg, reciprocal of #_rules, #_conditions in each rule.
                                          (a measure of the simplicity of rules)





2. CLUSTERING   is classification but without a predefined goal attribute.

  - Invents classes: produces classification scheme by partitioning training tuples
       into classes according to some similarity criteria (eg, closeness under some metric)

       and then, possibly, attempts to determine a "characteristic attribute"
       that acts like a goal attribute in classifying partition sets.


  - Tuples with similar attribute values are clustered into the same class.

    Tuples with dissimilar attribute values are put in different classes.


We will look at some examples later.


 

3. ASSOCIATION RULE MINING  is a matter of looking for association rules in data such as
   the two we have already briefly introduced:
        {SI, Diapers} =>   {Beer}
      {NIR>47, R<32 } =>   { Y>128 }

and can be formally defined as follows:


I = {i1..im} are the literals or items       (eg, purchasables items at Walmart)

     x,y will be used to represent itemsets  (subsets of I   i.e., sets of items)



D = {ti..tn} are the transactions    (eg, customer going thru Walmart checkout)

             where each transaction, t = (tid, t-itemset), consists of a transaction-identifier (tid)
                                                           and a transaction itemset (t-itemset)

             e.g., the t-itemset of a Walmart customer transaction is the set of items in his/her cart.



A transaction, t, is said to CONTAIN an itemset, x,   iff x is a subset of t-itemset.

    (the items, x, are in the customer shopping cart;  e.g., {SI, Diapers, Beer} are in the cart)



An Association rule is an implication x => y,   where x and y are itemsets that don't intersect.



Each rule has two value measures, support and confidence, which can be defined:


Itemset, x, has SUPPORT, s, in trans set, D,   if s% (or ratio) of the transactions in D contain x

   The support of a rule, x=>y, is the support of the itemset, x UNION y.


The confidence of a rule,  x=>y, in D is c   if c% (or ratio) of the transactions in D that
    contain x also contain y, written as ratio or fraction:

       supp(xUy)/supp(x)      (since both denominators are |D|)


     Support    = frequencies of occurring patterns.

     Confidence = strength of implication.
 


Association Rule Mining (ARM) procedures and Apriori Algorithm:
                                             -----------------

  The user specifies a minimum support level (msup) and a minimum confidence level (mconf).
  The task is to mine (or find) all association rules where support and confidence are at least at these thresholds.

  It can be broken into two steps:


1. Find all itemsets with high enough support (These will be the candidates for antecedent UNION consequent).
                                              (This is referred to as finding all large or frequent itemsets).


2. For each frequent itemset, derive all the rules supported by that set which have at least mconf confidence.

  That is:  given a frequent itemset, x, find all antecedent subsets, a, of x such that rule, a=>(x-a)
            has at least mconf confidence.



Overall performance of mining association rules is largely determined by 1.
        (the search for all frequent itemsets is usually the expensive part)

One can always use exhaustive search (look at each itemsets one at a time and calculate the support).

Exhaustive search has exponential complexity (the complexity increases exponentially with the number of items in I)
Therefore, it is usually not an option (would take forever).

So we look for alternative methods that basically "prune" down the complexity in some way (e.g., Apriori below).



The Apriori Algorithm  for discovering all frequent itemsets;


Key pruning idea is: Any subset of a frequent itemset must also be frequent.

WHY?


Start by finding all frequent 1-itemsets (itemsets with 1 item in them) by scanning T. 

    Only items from frequent 1-itemsets need be included in candidate frequent 2-itemsets.

    Only items from frequent 2-itemsets need be included in candidate frequent 3-itemsets.
    Etc.



Thus we prune down the complexity of exhaustive search by eliminating large groups of itemsets at each step.

The algorithm terminates when there are no frequent k-itemsets found. 
 



APRIORI (details):   Lk = set of frequent k-itemsets;
                     Ck = set of candidate k-itemsets

   1st ITERATION: frequent 1-itemsets are found by scanning T. 

   Kth ITERATION: assuming we have found L(k-1) ( the frequent (k-1)-itemsets ),
   create Ck (candidate k-itemsets) by applying the Apriori-gen procedure to L(k-1);
   then scan Ck for frequent itemsets.


   Apriori-gen:  generate only those k-itemsets whose every (k-1)-itemset subset is frequent. 
 
 


GENERATING RULES supported by frequent itemsets:

For each frequent itemset, l, output all rules   a=>l-a   where a is subset of l
   such that  conf(a=>l-a) is at least mconf

    note conf(a=>l-a) = supp(l)/supp(a) .



A pruning technique for this phase is based on:

If b is a subset of a,  supp(b) >= supp(a), thus  conf(b=>l-b)  <=  conf(a=>l-a)

     Thus, if antecedent, a, does not produce a high-conf rule then none of its subsets will either.

Thus, for each frequent k-itemset, l, start with the largest antecedents ( (k-1)-item antecedents )
    next consider only those (k-2)-item antecedents for which every (k-1)-item superset
         antecedent produced a high-conf rule.

    (equivalently, start with 1-item consequents and consider only those 2-item consequents for which
     both 1-item subsets produced high-confidence rules, etc.)


So to summarize the Apriori approach:

First determine all frequent itemsets: 
     start will 1-itemsets,
     then only consider unions of frequent 1-itemsets as candidate 2-itemsets, etc.

Next, for each frequent itemset found, search for the high-confidence rules it supports,
     by trying all 1-item consequents first (so that the antecedent is maximal size),
     then try 2-item consequents, but only those that are the union of 1-item consequents
          of high-confidence rules.  etc.




3.3 An Example of Applying Apriori Algorithm.

The item universe,         I = {A,B,C,D,E}
The transaction universe,  T =

 tid   t-itemsets
 ---   -----------------------------
 100   A             C      D 
 200          B      C             E 
 300   A      B      C             E 
 400          B                    E


Take:
msup=50%  (to be frequent, an itemset must be in at least 2 of the 4 transactions)

mconf=60% (to be a high-confidence rule, at least 60% of transactions which contain the antecedent
                                                                  must also contain the consequent).


The process of finding frequent itemsets:

Note:  As candidates for frequent 2-itemsets, we need not consider any 2-itemset
       with D in it since {D} is not a frequent 1-itemset.

       And for candidates for frequent 3-itemsets we need not consider any with
       {A,B} or {A,E} since they are not frequent 2-itemsets.



Cand_1-Isets        Cand_2-Isets          Cand_3-Isets             Cand_4-Isets is empty.
Iset Sup  Freq      Iset  Sup  Freq       Iset    Sup  Freq        Freq Iset generation ends. 
{A}    2  y         {A,B}   1             {B,C,E}   2  y
{B}    3  y         {A,C}   2  y  
{C}    3  y         {A,E}   1    
{D}    1            {B,C}   2  y  
{E}    3  y         {B,E}   3  y
                    {C,E}   2  y



Derive association rules.  

For frequent 3-itemsets, start with 1-item consequents:          High-Conf?
Rule1:  B and C => E,  confidence = 100%.  =Sup{B,C,E}/Sup{B,C}    y
Rule2:  B and E => C,  confidence = 66.7%. =Sup{B,C,E}/Sup{B,E}    y
Rule3:  C and E => B,  confidence = 100%.  =Sup{B,C,E}/Sup{C,E}    y

Form all 2-item consequents from high-conf 1-item consequents:
Rule4:  B => C and E,  confidence = 66.7%. =Sup{B,C,E}/Sup{B}      y
Rule5:  C => B and E,  confidence = 66.7%. =Sup{B,C,E}/Sup{C}      y
Rule6:  E => B and C,  confidence = 66.7%. =Sup{B,C,E}/Sup{E}      y


For each frequent 2-Isets, start with 1-item consequents:
For {A,C}
Rule7:  A => C,  confidence = 100%  = Sup{A,C}/Sup{A}               y
Rule8:  C => A,  confidence = 66.7% = Sup{A,C}/Sup{C}               y

For {B,C}
Rule9:  B => C,  confidence = 66.7% = Sup{B,C}/Sup{B}               y
Rule10: C => B,  confidence = 66.7% = Sup{B,C}/Sup{C}               y

For {B,E}
Rule11: B => E,  confidence = 100%  = Sup{B,E}/Sup{B}               y
Rule12: E => B,  confidence = 100%  = Sup{B,E}/Sup{E}               y

For {C,E}
Rule13: C => E,  confidence = 66.7% = Sup{C,E}/Sup{C}               y
Rule14: E => C,  confidence = 66.7% = Sup{C,E}/Sup{E}               y

All 14 rules are high-confidence.



EXAMPLE 2:
If mconf=80%, msup=50%:  We get the same frequent itemsets (since same msup).

Cand_1-Isets        Cand_2-Isets          Cand_3-Isets         Cand_4-Isets is empty.
Iset Sup  Freq      Iset  Sup  Freq       Iset    Sup  Freq    Freq Iset generation 
{A}    2  y         {A,B}   1             {B,C,E}   2  y       ends.
{B}    3  y         {A,C}   2  y  
{C}    3  y         {A,E}   1    
{D}    1            {B,C}   2  y  
{E}    3  y         {B,E}   3  y

Derive association rules.

For frequent 3-itemsets, start with 1-item consequents:          High-Conf?
Rule1:  B and C => E,  confidence = 100%.  =Sup{B,C,E}/Sup{B,C}    y
Rule2:  B and E => C,  confidence = 66.7%. =Sup{B,C,E}/Sup{B,E}     
Rule3:  C and E => B,  confidence = 100%.  =Sup{B,C,E}/Sup{C,E}    y

Form all 2-item consequents from high-conf 1-item consequents:
Rule5:  C => B and E,  confidence = 66.7%. =Sup{B,C,E}/Sup{C}       


For each frequent 2-Isets, start with 1-item consequents:
For {A,C}
Rule7:  A => C,  confidence = 100%  = Sup{A,C}/Sup{A}               y
Rule8:  C => A,  confidence = 66.7% = Sup{A,C}/Sup{C}                

For {B,C}
Rule9:  B => C,  confidence = 66.7% = Sup{B,C}/Sup{B}                
Rule10: C => B,  confidence = 66.7% = Sup{B,C}/Sup{C}                

For {B,E}
Rule11: B => E,  confidence = 100%  = Sup{B,E}/Sup{B}               y
Rule12: E => B,  confidence = 100%  = Sup{B,E}/Sup{E}               y

For {C,E}
Rule13: C => E,  confidence = 66.7% = Sup{C,E}/Sup{C}                
Rule14: E => C,  confidence = 66.7% = Sup{C,E}/Sup{E}                

Only Rules 1,3,7,11,12 are high-confidence.


EXAMPLE 3:    mconf=80%, msup=70%

 TID  Items 
 ---   -----------------------------
 100   A             C      D 
 200          B      C             E 
 300   A      B      C             E 
 400          B                    E

We get new frequent itemsets.

Cand_1-Isets        Cand_2-Isets          Cand_3-Isets is empty.
Iset Sup  Freq      Iset  Sup  Freq       Freq Iset generation
{A}    2            {B,C}   2             ends.
{B}    3  y         {B,E}   3  y  
{C}    3  y         {C,E}   2    
{D}    1                             
{E}    3  y                           

Derive association rules.

For frequent 2-itemsets, start with 1-item consequents:      High-Conf?
Rule1:  B       => E,  confidence = 100%.  =Sup{B,E}/Sup{B}        y
Rule2:  E       => B,  confidence = 100%.  =Sup{B,E}/Sup{E}        y

Rules 1,2 are high-confidence.


EXAMPLE 4:    mconf=80%, msup=80%

We get new frequent itemsets.

 TID  Items 
 ---   -----------------------------
 100   A             C      D 
 200          B      C             E 
 300   A      B      C             E 
 400          B                    E

Cand_1-Isets        Cand_2-Isets is empty.
Iset Sup  Freq      Freq Iset generation
{A}    2            ends.
{B}    3            
{C}    3           
{D}    1                             
{E}    3                              

Derive association rules.
There are no frequent itemsets.  Done.


These examples should demonstrate how much the pruning rules simplify
the cases with higher support and confidence.
 


*****************************************************************
Quantitative Association Rule Mining

In order to do association rule mining on quantitative data, such as Remotely
Sensing Image data, using the methods of market basket research we need to
do some mapping  of concepts.

First let's review some features of Remotely Sensed Imagery (RSI).

Many RSI data sets are organized with a separate file for the values in each color
band (or more generally, each "spectral range").

For instance,  Thematic Mapper (TM) images (from the Landsat satellites) come as
7 separate files, one for each spectral range collected by the instruments:

File-1 (or "Band 1") is "visible blue color".  That is, each row of the Band-1 file
is a byte and indicates the amount of blue light (number of photons) collected by
a blue-light sensor for a particular small square on the surface of the earch (a pixel).
Pixels measure ~27 meters square.   TM images cover ~170 kilometers square and there are
~ 6500 rows of pixels and ~6500 columns of pixels, totaling ~ 40,000,000 pixels arranged
in raster order in File-1 (that is the 6500 readings along the first row are listed first,
followed by the 6500 pixels of the second row, etc.)

File-2 or Band-2 is "visible green color" with the same make up.  The 7 Bands are:

Band   Spectral range   
----   --------------  
1      blue                                \
2      green                                } "visible" bands
3      red                                 /
4      near infrared (or "red-infrared)   \
5      mid-infrared 1                      \
6      thermal infrared                     } "infrared" bands
7      mid-infrared 2                      /


Sometimes you will read that TM is composed of VIR reflectances (Visible and Infrared)


This organization into separate band files for each spectral range is called Band Sequential or BSQ.


Other RSI data (e.g., SPOT data from French Satellites) is organized as

Band-Interleaved-by-Row or BIR in which there is just one file in which the first row of the
  first band is followed by the first row of the second band,..., followed by the first row of the last band,
  followed by the second row of the first band, followed by the second row of the second band,...

Still other RSI data (e.g., some digital photography - tiff images) are organized as

Band-Interleaved-by-Pixel or BIP, in which there is just one file, the first pixel-value of the
  first band is followed by the first pixel-value of the second band,..., the first pixel-value of the last band,
  followed by the second pixel-value of the first band,...


Yet another organization at the extreme of this pattern would be Band-Interleaved-by-bit or BIb in which
  there is just one file, the first bit of the first pixel-value of the first band is followed by the first bit of
  the first pixel-value of the second band,..., the first bit of the first pixel-value of the last band,
  followed by the second bit of the first pixel-value of the first band,...

At the other end of this organization pattern is bit-SeQential or bSQ in which there is a separate file for
each bit of each band, B11,...,B18, B21,...,B28 ... Bn1,...,Bn8.  We will use bSQ organization later in this course.


We have the following spectrum of organizations:


         more interleaving-- >

bSQ      BSQ       BIL      BIP      BIb

       < -- more files


Clearly a simple algorithm can convert among these organizations.

**********************************************************************************

The items:   I = { (b,v) : b = a band, v = a reflectance value }    ( v = number from [0,256) )



The transactions:  D = { t : t is a pixel },  where tid = (x,y), the pixel row,col and
                                              t-itmeset = { (b,v) : b ranges over all bands and v is the reflectance
                                                                    at pixel, t, in band, b. }

     (if needed we will use ( b,v(t,b) ) for t-itemset elements)


We note right away that the sizes are very very large in the ARM sense (e.g., for TM-7
satellite images (with yield map?), there are ~40,000,000 transactions, 8*256 = 2048 items
and 2^(2048) itemsets!)

  - We can reduce the number of transactions (pixels) by focusing on a particular
    small area (e.g., a field).

  - We can reduce the number of itemsets by noting that a pixel can have only one reflectance value from a given band.
    And therefore we can restrict our itemset consideration to those that have one interval of values from each band.
    In a given band there are 255 ways to pick the left endpoint of the interval and for left-endpoint, l, there are
    255-l ways to pick a right endpoint.  On the average there will be 127 ways to pick the right endpoint.
    Thus, there are really only (255*127)^8 or ~(2^8*2^7)^8 = 2^120 = 10^36 =
    1,000,000,000,000,000,000,000,000,000,000,000,000 itemsets to consider.

  - We can reduce the number of items by  partitioning the Bands into intervals and
    letting each interval correspond to an value.


Partitioning bands into intervals:

Equilength interval partitioning.

By truncating some of the right-most bits of the reflectance values (low order or
least significant bits) we can reduce the size of the itemset dramatically without
loosing too much information (the low order bits show only slight differences).

As an example, we can truncate the right-most 6 bits, resulting in just 4 intervals,
   each of which we consider to be a "value" (e.g., identify each interval with its midpoint):

   [0,64), [64,128), [128,192), [192,256)   identified with values, 32, 96, 160, 224

Then there are only 10^8 itemsets or ~ = 100,000,000 itemsets (10 intervals in each band?).

That's still a lot!


Further pruining can be done by understanding what kinds of rules are probably of interest
to the user and focusing on those only.  For instance:

For a precision farmer, there is probably little interest in rules of the type, R>48 => G<134
A physicist might be interested in relationships among colors observed (both antecedent and
consequent are from visible bands), but the farmer is interested only in relationships
where the antecedent is from the color bands and the consequent if from the yield band (he or she
wants to know what observed color combinations predict high yield).
Therefore, for precision agriculture, we could restrict to those rules that have consequent from
the yield band (and then only the particular interval which indicates "high yeild") and antecedent
from the others, so 10^7 = 10,000,000 itemsets to consider.

Slalom analogy:

Each transaction (pixel), t, is like a path down a ski hill,
each item is an interval in one band and therefore like a "gate" on the ski slope:

A transaction (pixel) "contains" an itemset, if it "goes thru" each gate
(has band-i reflectance in interval-i).

So if x is an itemset (set of "gates", one for each band),
s(x) is the proportion of paths passing thru the gates of x.

                  b1    b2    b3    b4    b5    b6    b7    b8 
                  |                 |    .---.        |     | 
      t---.       |                     / |   \       |     | 
           `---------------------------'  |    \      |     | 
                  |     |                 |      \    |    .----
                  |     |           |     |       \_______/ | 
 



Non-equi-length:
In some cases, it would be better to allow users to partition interval into
uneven lengths.  User knowledge can be applied in interval partition.
Eg,, band bi can be partitioned into 3 intervals {[0,63), [64,127), [128,256)
(if aren't many values between 128 to 255.)

Applying user's domain knowledge increases assoc rules accuracy and efficiency. 

Equi-depth partitioning (each partition has approx. the same number of pixels).
Can be done by setting the endpoints so that there are (approximately) the same
number of values in each interval (at the mean value), etc.
  Sometime this leads to more reasonable rules.
 

Whether partitioning is equilength or not, it can be easily characterized as follows:
For each band, choose interval end-points, e0=0, e1, ..., en+1=256,

 then the items are   ( bi, [ei,ei+1 ) ), i=0,..n

(in the equilength case there would be a common length,  ei - e(i-1) = a constant),


Generally, we are not so interested in rules that relate reflectances in certain bands
with reflectances in others.  A more interesting situation is to juxtapose RSI data
with, say crop yield data (or crop quality or...).
 
We do this by organizing our yield data identically with the RSI data, say as Band-8
(except of course that the the values in band 8 represent the yield for that pixel
in bushel per acre or some other unit of measure, not reflectance).  In this situation
we are interested only in rules where the antecedent is from Bands 1-7 and the consequent
is from band 8.  That allows tremendous pruning of cases to consider.

The SMILEY software product has been programmed to do precisely this type of precision
agriculture data mining.  (see http://smiley.cs.ndsu.nodak.edu).





In ARM, there are other problems to consider.  For instance there are
MISLEADING RULES and REDUNDANT RULES

In the market basket case,
consider T={tea} and C={coffee} |D|=100 transactions.


MISLEADING RULES:

            coffee   NOTcoffee| total 
            .-----------------|----    Conf(T=>C)=|TUC|/|T|= 20/25=  .8 
       tea  |   20       5    |  25    Supp(C)=|C|/|D|=90/100     =  .9 
    NOTtea  |   70       5    |  75    
      ------|-----------------|----    So the rule T=>C is misleading. 
      total |   90      10    | 100



REDUNDANT RULES:
             coffee  NOTcoffee| tot 
            .-----------------|----   C(T=>C)=20/22 =  .9090 
       tea  |   20       2    |  22   Supp(C)=90/100=  .9000 
    NOTtea  |   70       8    |  78                           
      ------|-----------------|----   Within .0090 of each other
      total |   90      10    | 100   so they are essentially redundant rules.



REFERENCES
1.[Quinlan 86] J.R.Quinlan. Induction of decision trees, Machine Learning,V1,81-106.1986.
2.[Quinlan 93] C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann,1993.
3.[Quinlan 96] Improved use of continuous attributes in C4.5. J of AI Res(JSIR)4,77-90,96.
4 R.Agrawal, R.Srikant "Fast Algs for Mining Assoc Rules", VLDB, Santiago, Chile, Sep 94.
5.R.Srikant, R.Agrawal "Mining Quantitative Assoc Rules in Large Rel Tables", SIGMOD 96.
6.R.Rastogi,et al. "Mining Optimized Assoc Rules for Categorical & Numeric Attr." ICDE 98.
7.R.Rastogi,et al. "Mining Optmized Support Rules for Numeric Attributes" ICDE, 1999
8.R.J. Miller and Y. Yang. "Association Rules over Interval Data", ACM SIGMOD 97.
9.T.Fukuda, et al. "Mining Optimized Assoc Rules for Numeric Attr," ACM PODS'96.
10. T.Fukuda,et al. "Data Mining Using 2D Opt Assoc Rules: Scheme, Algs, Vis" SIGMOD'96
11. J.Gehrke,et al. "BOAT--Optimistic Decision Tree Construction", ACM SIGMOD/PODS'99.
12. C. Silverstein, et al. "Beyond Market Baskets: Gen Assoc Rules to Dependence Rules",
     Data Mining and Knowledge Discovery,2,39-68(1998). 



The following appendixes are for you reading when you have time.  The are somewhat
repetitive but also provide more information for the interested reader.
You will not be tested on material that is in the appendixes but not in the notes proper.



APPENDIX A:

A paper on the subject:
Mining Association Rules from Remotely Sensed Data   by Jianning Dong

ABSTRACT

The explosive growth in data and databases has generated an urgent need
for new techniques and tools that can intelligently and automatically
transform the processed data into useful information and knowledge.
Data mining is such a technique that extracts nontrivial, implicit,
previously unknown, and potentially useful information (such as knowledge
rules, constraints, and regularities) from data in databases.
In this paper, we define a new type of data mining problem--mining
association rules from remotely sensed data--and its application in 
precision agriculture. In this application, 
association rules are the statement of the form "in 90% cases, if band
one's reflectance is between 0~15, band two's reflectance between 32~63,
and band three's reflectance between 128~255, then that field has a high
yield."  It is a quantitative data mining problem that is different
from a market basket data mining problem widely found in business applications.
Based on the characteristics of the remotely sensed data and the problem 
itself, we present a bit-oriented formal model and discuss the issues of
partitioning quantitative attributes into equal and uneven and
discontinuous partitions.  We propose two new pruning techniques and compare
the performances with a base algorithm.  Finally, we implement a tool
in JAVA that allows users to interactively mine association rules from 
remotely sensed data. My tool has been successfully integrated with
SMILEY, a World Wide Web based satellite imagery analyzer and viewer.


1. INTRODUCTION	
1.1 What is Data Mining	
1.2 The Primary Tasks of  Data Mining
1.3 paper Overview
2. BACKGROUND ON ASSOCIATION RULES
2.1 Definition of Association Rules	
2.2  Dynamic Programming Terminology for Mining Algorithm
2.3  Base Algorithm
2.3.1 Apriori Algorithm	
2.3.2 An Example of Applying Apriori Algorithm	
2.4 Bit Vector Based DLG Algorithm
2.4.1 Association Graph Construction
2.4.2 Large Itemset Generation	
3. MINING ASSOCIATION RULES FROM REMOTELY SENSED DATA
3.1 Problem Definition	
3.2 Partitioning Quantitative Attributes	
3.3 Uneven Depth and Discontinuous Intervals Partition	
3.3.1  Uneven Depth Partition	
3.3.2  Discontinuous Intervals Partition
3.4 Formal Model	
3.5  Finding Large Itemsets from Imagery Data	
3.5.1 New Pruning Techniques for Fast Data Mining	
3.5.1.1  Technique One	
3.5.1.2   Technique  Two	
3.6 Summary of the New Algorithm	
3.7  An Example for Applying the New Algorithm	
4. IMPLEMENTATION	
4.1 SMILEY	
4.2  Integration with SMILEY	
4.2.1 New Feature Added into SMILEY
4.2.2  Functionalities of the New Tool	
5. CONCLUSION
BIBLIOGRAPHY	



1. INTRODUCTION 

1.1 What is Data Mining

In the last decade, we have seen an explosive growth in our capabilities to
generate  and collect data. Advances in scientific data collection (e.g.,
remote sensors or space satellites), the widespread introduction of bar
codes for almost all commercial products, and the computerization of many
businesses (e.g., credit card purchases) and government transactions
(e.g., tax returns) have generated a flood of data. Advances in data storage
technology, such as faster, higher capacity, and cheaper storage devices
(e.g., magnetic disks or CD-ROMS); better database management systems; and
data warehousing technology, have allowed us to transform this data deluge
into "mountains" of stored data.

A representative example is the NASA (National Aeronautics and Space 
Administration) Earth Observing System (EOS) of orbiting satellites and
other spaceborne instruments. Each satellite bears several sensors for
long-term global observations of the land surface, biosphere, solid Earth,
atmosphere, polar ice, and oceans. EOS is projected to generate on the
order of 50 gigabytes of remotely sensed image data per hour when 
operational in the late 1990s and early in the next century [WS91]. 

Clearly, such huge volumes of data overwhelm the traditional manual
methods of data analysis such as spreadsheets and ad hoc queries.
Many data access and reporting tools--relational database management
systems (RDBMSs), multidimensional analysis tools, ad hoc query and
reporting software, and statistical analysis packages--let users probe
all these huge data-stores. These tools do not enable users to find
patterns hidden in vast database or to pinpoint the factors they are
seeking that will help them make faster, more accurate decisions.
It is not realistic to expect that human experts carefully analyze
all this data.  As pointed out, a significant need exists for a new
generation of techniques and tools that can intelligently and
automatically transform the processed data into useful information
and knowledge. Consequently, data mining has become a research area
with increasing importance.

Data mining, which is also referred to as knowledge discovery in
databases, means a process of nontrivial extraction of implicit,
previously unknown, and potentially useful information (such as
knowledge rules, constraints, and regularities) from data in databases 
[CHY96]. The general idea of discovering "knowledge" in large amounts
of data is both appealing and intuitive, but technically it is
significantly challenging and difficult. For example, the discovered
information should not be obvious; the information extracted should
be simpler than the data itself; implying that there should be a
high level language for expressing such information; the information
should be interesting, etc. [FUM96]. 

Data mining is an inter-disciplinary subject formed by the intersection
of many different areas. Researchers in knowledge base systems,
artificial intelligence, machine learning, knowledge acquisition,
statistics, spatial database, and data visualization have also shown
great interest in data mining. Since data mining poses many challenging 
research issues, direct applications of methods and techniques developed
in related studies of machine learning, statistics, and database systems
cannot solve these problems. It is necessary to perform dedicated studies
to invent new data mining methods or to develop integrated techniques
for efficient and effective data mining. In this sense, data mining
itself has formed an independent new field.  The database research
community has observed that data mining, together with data warehousing
and data repositories, is a new use of database technology, which are
considered as important areas in database research.

Due to its complexity, data mining technology has traditionally been
used in scientific and engineering settings since it originated in
university labs. Data mining is now growing common in business
environments, particularly in companies with large volumes of data,
communities of users who are not data analysis specialists, and
corporate data that is detailed and multifaceted, with data
relationships that are ad hoc and changeable, not predetermined or
even logical.

In the business world, the most successful application of data mining
is the "Market Basket" application. It is used to analyze transaction
databases and look for patterns among existing customer transactions.
Those patterns are used to help make business decisions, such as what
to put on sale, how to design coupons, how to place merchandise on
shelves in order to maximize the profit, etc. Another major business
use of data mining methods is the analysis and selection of stocks and
other financial instruments. Several successful applications have been
developed for analyzing and reporting data changes. These include 
supermarket sale data and health care database. 

A number of interesting and important scientific applications of data
mining have also been developed.  Example application areas in science
include astronomy, molecular biology, and global climate change
modeling. Furthermore, several emerging applications for information
providing services, such as on-line services and the World Wide Web 
(WWW), also call for various data mining techniques to better
understand user behavior, to meliorate the services provided, and to
increase the business opportunities.  

One shall note that the manual search of data, search assisted by
queries to a database management system (DBMS), or humans visualizing
patterns in data are not referred to as data mining [FUM96].  The data
mining community has focused mainly on automated methods for
extracting patterns and/or models from data. The state-of-the-art 
technique in automated methods of data mining is still in a fairly
early stage of development.  There are no established criteria for
deciding which methods to use in which circumstances, and many of
the approaches are based on crude heuristic approximations to avoid
the expensive search required to find optimal, or even good, solutions.

1.2 The Primary Tasks of  Data Mining 

The primary goals of data mining in practice are prediction and
description.  Prediction involves using some variables or fields in
the database to predict unknown or future values of other variables
of interest. Description focuses on finding interpretable patterns
that describe the data.  The relative importance of prediction and
description for particular data mining applications can vary considerably.

The goals of prediction and description are achieved by using the
following primary data mining tasks [FL98].

(1) Association Rules.
The task of mining association rules in transaction or relational 
databases is to derive a set of strong association rules in the form of
A1,..,Am, B1,.., Bn, where Ai (for i=1,..,m) and Bj (for j=1,..,n})
are sets of attribute values, from the relevant data sets in a database.
An example of such an association rule is the statement that "70% of
transactions that purchase bread also purchase butter."  Support and 
confidence specified by the user are the major parameters determining
the quality of the discovered association rules.  A natural application
of association rules is to increase the sales of some item.  In this
very simple example, this rule suggests a way of increasing the sales
of butter in a supermarket, by placing butter in a shelf close to the
bread's shelf.  In real-world very large basket data, some interesting,
unexpected association rules can be discovered and used in a similar way.
We will discuss association rules in detail in the next chapter.

(2) Classification Rules.
In the classification task, each tuple belongs to a class in a
pre-defined set of classes.  The class of a tuple is indicated by the
value of a user-specified goal attribute.  Tuples consist of a set of
predicting attributes and goal attribute. The latter is a categorical
(or discrete) attribute.  For example, it can take on a value out of
a small set of discrete values, called classes or categories.  The
aim of the classification task is to discover some kind of relationship
between the predicting attributes and the goal attribute, so the 
discovered knowledge can be used to predict the class (value of goal
attribute) of a new, unknown-class tuple. The classification rule is
often represented in the form of "if-then." 

These rules are interpreted as follows: if the predicting attributes of
a tuple satisfy the conditions in the antecedent of the rule, then the
tuple has the class indicated in the consequent of the rule.  There
are two major criteria often used to evaluate the quality of the 
classification rules: measuring classification error rate and
comprehensibility of the discovered rules.

(3) Summarization Rules.
The aim of the summarization task is to produce a characteristic 
description of each class of tuples in the target data set.
This kind of description somehow summarizes the attribute values of
the tuples that belong to a given class (value of goal attribute).
That is, each class description can be regarded as a conjunction of some 
properties shared by all or most tuples belonging to the
corresponding class.  The characteristic rule is often represented in
the form of if-then. These rules are interpreted as follows:
if a tuple belongs to the class indicated in the antecedent of the rule,
then the tuple has all the properties mentioned in the consequent of
the rule.  It should be noted that, in summarization rules, the class
is specified in the antecedent (if part) of the rule, while in
classification rules, the class is specified in the consequent
("then part") of the rule.

(4) Clustering rules.
Clustering task produces a classification scheme that groups a set of 
data such that the intracluster similarity is maximized and the
intercluster similarity is minimized [HKK96].  It means that tuples
with similar attribute values are clustered into the same class.
Once the classes are invented, you can apply a classification
algorithm or a summarization algorithm to them in order to generate
classification or summarization rules for those classes.  The quality
of a produced classification scheme is important to the discovered
clustering rules.  An approach is to measure the extent to which
membership of a tuple into a given class reduces the uncertainty
about attribute values for that tuple.  In this sense, the clustering
task involves a kind of many-to-many prediction since any attribute 
can be used to determine the clusters and to predict the values of
other attributes.  This many-to-many prediction is in contrast with
the many-to-one prediction that is associated with the classification
task, where one attribute is treated as the goal attribute and all
other attributes are used to predict only the goal attribute value.

 (5) Similarity.
The task of searching for similar patterns in a temporal or
spatial-temporal database is essential in many data mining operations
in order to discover and predict the risk, causality, and trend
associated with a specific pattern. Typical queries for this type of 
database include identifying companies with similar growth patterns,
products with similar selling patterns, stocks with similar price
movement, images with similar weather patterns, geological features,
environmental pollution, or astrophysical patterns. 

(6) Path Traversal Patterns.
Mining path traversal patterns. In a distributed information providing
environment, documents or objects are usually linked together to
facilitate interactive access.  Understanding user access patterns
in such environments will not only help improving the system design
but also be able to lead to better marketing decision.  Capturing
user access patterns in such environments is referred to as mining
path traversal patterns.

A number of data mining systems developed to meet the requirements of
many different application domains have been proposed in the literature.
As a result, one can identify several different data mining tasks,
depending mainly on the application domain and on the interest of the user.
In general, each data mining task extracts a different kind of knowledge
from a database, so each task requires a different kind of data mining
algorithm, see [AIS93], [ASR94], and [CNF96].  These database-oriented
mining algorithms can be classified into two categories: concept
generalization-based discovery and discovery at the primitive concept
levels.  The former relies on the generalization of concepts (attribute 
values) stored in databases.  The latter discovers strong regularities
(rules) from the database without concept generalization.  Association
rules are important in the latter approach. 

1.3 Overview

This work defines a new type of data mining problem--mining association
rules from remotely sensed data--and its application in precision
agriculture. The rules are mined from a set of remotely sensed data.
Each band of an image scene is mapped to an attribute in a relational
database, and the domain of each attribute is from 0 to 255.
A collection of bands composes a relational database.
The task of mining association rules from remotely sensed data is to
derive strong association rules between intervals from different bands. 

In precision agriculture applications, each image scene is either
a collection of spectral bands representing the reflectance level
of each pixel in that image or an image map recording a particular
agriculture phenomenon such as yield. The mined association rules
are the statement in the form "in 90% cases, if band one's reflectance
is between 0~15, band two's reflectance between 32~63, and band three's
reflectance between 128~255, then that field has a high yield."

In the remaining chapters of the paper, we will discuss how to solve
the above problem regarding problem definition, algorithm issues,
performance improvement, and implementation.  The paper is organized
as follows.  In Chapter 2, we discuss details of association rules and
existing algorithms for mining association rules. These algorithms are 
the basis for solving my problem.  However, these algorithms are not
suitable in the sense that they do not handle quantitative attributes.
In Chapter 3, we first describe my problem informally.  Then, we present
a bit-oriented formal model. For a quantitative attribute, a common
practice is to divide it into several consecutive intervals.  Due to
the unique characteristic of remotely sensed data and the problem itself,
We discuss several different ways to divide quantitative attributes into
equal, uneven, and discontinuous intervals that will increase the
algorithm efficiency and knowledge accuracy. Since the performance of 
an algorithm is very important, we propose two new pruning techniques for
generating candidate itemsets that are particularly designed for my
problem. we compare the performance with a base algorithm.  An example
is given at the end of Chapter 3 to illustrate the steps of the new
algorithm. Chapter 4 discusses the implementation of the new tool and
its integration with SMILEY, a WWW-based imagery analyzer and viewer. 
Chapter 5 concludes work.



2. BACKGROUND ON ASSOCIATION RULES

The task of discovering association rules was introduced by Agrawal,
Imielinske, and Swami in 1993.  In its original form, this task was
defined for a special kind of data, often called basket data, where
a tuple consisted of a set of binary attributes called items.  Each
tuple corresponds to a customer transaction, where a given item has
a value of true or false, depending on whether or not the
corresponding customer bought the item in that transaction.  This kind
of data is usually collected through bar-code technology; the typical 
example is a supermarket scanner. 

An association rule is an expression X IMPLIES Y, where X and Y are
sets of items.  The intuitive meaning of such a rule is that
transactions of the database which contain X tend to contain Y.
An example of such a rule might be that 70% of customers who purchase
bread also purchase butter.
		
2.1 Definition of Association Rules 

Association rules can be formally defined as follows.

Let I = {i1,i2,..,im} be a set of literal, called items.
Let D = {t1,t2,..,tn} be a set of transactions, where each transaction,
t, is a set of items such that  t is a subset of I.  Note that the 
quantities of items in a transaction are not considered.
Each transaction is associated with an identifier, called TID.
Given an itemset, X, a subset of I; a transaction t contains X if,
and only if, X is a subset of t.  The itemset X has support, s,
in the transaction set D if s% of transactions in D contain X;
we denote s = support (X).  An association rule is an implication
of the form X => Y, where X, Y are subsets of I,
and X INTERSECT Y = EMPTY.  Each rule has two measures of value,
support, and confidence.  The support of the rule X => Y is
support (X UNION Y).  The confidence, c, of the rule X => Y in
the transaction set D means c% of transactions in D that contain
X also contain Y, which can be written as the radio

support (X UNION Y) / support (X). 

Support indicates the frequencies of the occurring patterns,
and confidence denotes the strength of implication in the rule.
Given a user specified minimum support (called minsup) and
minimum confidence (called minconf), the problem of mining
association rules is to find all the association rules where
support and confidence are larger than the user defined minsup
and minconf.  It can be decomposed into two subproblems: 

(1) The large itemsets.
Find all itemsets that have support above the predetermined 
minimum support. These itemsets are called large
itemsets -- sometimes called frequent itemsets. 

(2) For each large itemset, derive all rules that have more than the
predetermined minimum confidence as follows: for a large itemset X and Y,
where X, Y are subsets of I, and X INTERSECT Y = EMPTY,
if support (X UNION Y) / support (X) greater-equal the minimum-confidence,
then the rule X => Y is derived.

The overall performance of mining association rules is determined
by the first step.  After the large itemsets are identified, the
corresponding association rules can be derived in a straightforward manner.

2.2  Dynamic Programming Terminology for Mining Algorithm

Dynamic programming is an optimization procedure that is
particularly applicable to problems requiring a sequence of
interrelated decisions. Each decision transforms the current
situation into a new situation. A sequence of decisions, which
in turn yields a sequence of situations, that maximizes (or
minimizes) some measures of a value is sought.  The value of a
sequence of decisions is generally equal to the sum of the values of
the individual decisions and situations in the sequence.  What is
common to all dynamic programming procedures is that a given "whole
problem" can be solved if the values of the best solutions of certain
subproblems can be determined (the principle of optimality) [DRE96].

In mining association rules, the problem of finding the large
itemsets is fitting the above general description.  My goal is to
find the large k-itemsets; this problem can be solved if the large (
k-1)-itemsets are found. One can use large (k-1)-itemsets to generate
candidate k-itemset.  The optimal policy is that candidate
k-itemset's support is greater than the user defined support.
Solve the problem of finding large (k-1)-itemsets rely on the 
solution of large (k-2)-itemsets, and so on.  Candidate 1-itemset
is the boundary value that can be found in the transaction database. 

2.3  Base Algorithm

2.3.1 Apriori Algorithm

Various algorithms have been proposed to discover the large itemsets,
see [AIS93], [ASR94].  The Apriori algorithm is one of the most
popular algorithms in the mining of association rules in a
centralized database. The main idea of Apriori is outlined in the 
following  sections [FUM96].  See all the notations in Table 2.1.



Table 2.1.  Notation for mining algorithm

k-itemset     An itemset having k items.
   Lk         Set of large k-itemset.
   Ck         Set of candidate k-itemset.


(1) The large itemsets are computed through iterations.

In each iteration, the database is scanned one time, and all large
itemsets of the same size are computed. The large itemsets are
computed in the ascending order of their sizes.

In the first iteration, the size-1 large itemsets are computed by
scanning the database once.  Subsequently, in the kth iteration
(K > 1), a set of candidate sets, Ck, is created by applying the
candidate set generating function Apriori-gen on Lk-1, where Lk-1 is 
the set of all large (k-1)-itemsets found in iteration k-1.
Apriori-gen generates only those k-itemsets whose every (k-1)-itemset
subset is in Lk-1. The support counts of the candidate itemsets in
Ck are then computed by scanning the database once, and the size-k
large itemsets are extracted from the candidates.

Apriori candidate generation.

The Apriori-gen function takes as an argument Lk-1 , the set of all
large (k-1)-itemsets.  It returns a superset of the set of all
large k-itemsets.  First, in the join phase, Lk-1 is joined with
itself, the join condition being that the lexicographically ordered
first k-2 items are the same, and that the attributes of the last two 
items are different.  Second, in the subset pruning phase, all
itemsets from the join result which have some (k-1)-subset that is
not in Lk-1 are deleted. The Apriori algorithm is shown in Figure 2.1.


L1 = {large 1-itemsets};
For (k = 2; Lk-1 not= 0; k++ ) do begin
      Ck = apriori-gen (Lk-1);             // New candidates
      For all  transactions t in D do begin
          Ct = subset (Ck, t);              // Candidates contained in t
        For all candidates c in Ct do
              c.count ++;
     end
     Lk = {c in Ck | c.count greater_equal minsup}
End
Answer = UNION(k) Lk;

Figure 2.1. Apriori algorithm.

(2) Generating rules.

For every large itemset, l, we output all rules  a IMPLIES (l-a),
where a is a subset of l, such that the ratio support (l) / support (a)
is at least minconf.  The support of any subset a* of a must be as great
as the support of a.  Therefore, the confidence of a* IMPLIES (l-a*)
cannot be more than the confidence of a IMPLIES (l-a).  Hence, if a
did not yield a rule involving all the items in l with a as the
antecedent, neither will a*. It follows that for a rule a IMPLIES (l-a)
to hold, all rules of the form  a* IMPLIES (l-a*)  must also hold,
where a* is a nonempty subset of a.

From a large itemset, l, the algorithm first generates all rules with
one item in the consequent.  The algorithm then use the consequents
of these rules to generate all possible consequents with two items
that can appear in a rule generated from l, etc.  The rule generation
algorithm is shown in Figure 2.2.


For all large k-itemsets lk, k greater-equal 2, do begin
      H1 = {consequents of rules from lk with one item in the consequent};
      Call ap-genrules (lk, H1);
End
Procedure ap-genrules (lk: large k-itemset, Hm: set of m-item consequents)
      If  (k greater m+1) then begin
           Hm+1 = apriori-gen (Hm);
           For all hm+1 in Hm+1, do begin
                Conf = support (lk) / support (lk - hm+1);
                 If  (conf greater-equal minconf ) then
                      Output the rule (lk - hm+1) implies hm+1
                         With confidence = conf and support = support (lk);
                 Else  
                       Delete hm+1 from Hm+1;
           End
           Call ap-genrules (lk, Hm+1);
     End

Figure 2.2.  Rule generation algorithm.


2.3.2 An Example of Applying Apriori Algorithm

Consider the database in Table 2.2.
       
 Table 2.2.   Sample transaction database

     TID   Items
     ---   ----------------------
     100   A      C      D
     200   B      C      E
     300   A      B      C     E
     400   B      E


Let  minimum-support =50% and minimum-confidence = 60% . Since there are 
four records in the table, the number of  transactions above the minsup
is  2 (4 x 50% = 2). 


Figure 2.3 shows the process of finding large itemsets.


Database_D	    Candidate_1-itemset	       Large_1-itemset
TID   Items         Itemset    Support_Count   Itemset   Support_Count
100   A C D         {A}           2            {A}          2
200   B C E    -->  {B}           3            {B}          3
300   A B C E       {C}           3            {C}          3
400   B E           {D}           1            {E}          3
                    {E}           3



Candidate_2-itemset  Candidate_2-itemset        Large_2-itemset
Itemset              Itemset   Support_Count    Itemset   Support_Count
{A, B}               {A, B}        1            {A, C}       2      
{A, C}               {A, C}        2            {B, C}       2
{A, E}       -->     {A, E}        1            {B, E}       3
{B, C}               {B, C}        2            {C, E}       2
{B, E}               {B, E}        3
{C, E}               {C, E}        2


Candidate_3-itemset  Candidate_3-itemset        Large_3-itemset
Itemset              Itemset   Support_Count    Itemset   Support_Count
{B, C, E}    -->     {B, C, E}     2            {B, C, E}    2


Figure 2.3. The process of finding large itemsets.
 

(1) Large 1-itemset generation. Scan the database and count the
support for every item, the  large 1-itemset  is {A, B, C, E}.

(2) Large k-itemset generation. Mining algorithm includes candidate
generation and pruning two phases.  We show the result step by step.
Candidate 2-itemset: {{A,B}, {A,C}, {A,E}, {B,C}, {B,E}, {C,E}}.

Count each candidate 2-itemset support, then prune the 2-itemset whose
support is lower than minsup.  We have: 

Large 2-itemset :    { {A,C}, {B, C}, {B, E}, {C, E} }.
Candidate 3-itemset: { {B, C, E} }.
Large 3-itemset:     { {B, C, E} }.

Since large 4-itemset is empty, large k-itemset generation terminates.

(3) Derive association rules.  We have large 3-itemset {{B, C, E}}
where s = 50%.  Remember  the predetermined minconf = 60%. we get:

B and C implies E,   with support = 50%  and confidence = 100%.
B and E implies C,   with support = 50%  and confidence = 66.7%.
C and E implies B,   with support = 50%  and confidence = 100%.
B implies C and E,   with support = 50%  and confidence = 66.7%.
C implies B and E,   with support = 50%  and confidence = 66.7%.
E implies B and C,   with support = 50%  and confidence = 66.7%.


2.4 Bit Vector Based DLG Algorithm

In the past, researchers have been focusing on improving the
performance of association rules mining algs. Some published papers
introduce new algorithms and give experimental results to show
algorithms improvement [YC96], [PCY97].

In this section, we introduce one of them from which we also
take advantage of the idea of bit operation.

[YC96] proposed a Direct Large itemset Generation (DLG) algorithm
for efficient large itemset generation.  DLG only needs to scan
the database once. Empirical evaluations show that DLG outperforms
other algorithms that need to make multiple passes over the database.
There are three phases in the DLG algorithm: 

(1) Large 1-itemset generation,
    generates large 1-itemsets and record info. 

(2) Graph construction phase,
    an association graph to indicate the associations between large
    items. In this phase, large 2-itemsets can be generated. 

(3) Large itemsets generation phase,
    which generates large k-itemsets (k>2) based on the
    constructed association graph.


2.4.1 Association Graph Construction

In the first phase, algorithm DLG scans the database once to count
the support and build a bit vector for each item.  The length of
each bit vector is the number of transactions in the database.
The bit vector associated with item i is denoted as BVi.
The number of 1s in BVi is equal to the number of transactions
which support  item i, that is, the support for  item i.

For example, consider the database in Figure 2.4.  Assume that
the minimum support is 50%, that is, 2 transactions.
           
TID          Itemset                  Bit Vector
100          A  C   D                 BVa = (1 0 1 0)
200          B   C  E                 BVb = (0 1 1 1)
300          A   B  C  E              BVc = (1 1 1 0)
400          B   E                    BVe = (0 1 1 1)

Figure 2.4. A database of transactions and corresponding bit vector.


Property 1.  The support for the itemset {i1, i2, . . , ik}
is the number of 1s in BVi1 ^ BVi2 ^ . .   ^ BVik, where the
notation ^ is a logical AND operation.

In the graph construction phase, DLG constructs an association graph
to indicate the association between items. For the association graph,
if the number of 1s in BVi ^ BVj (i lessthan j) is no less than the
minimum support, a directed edge from item i to item j is constructed.
Also, itemset {i, j} is a large 2-itemset.  The association graph for
the above example is shown in Figure 2.5, and the large 2-itemsets
are {A, C}, {B, C}, {B, E}, and {C, E}.

            A  (1010)
               |
               |
               |
B (0111) ----------> E (0111)
        \      |     ^
         \     |    /
          v    v   /
           C (1110)

Figure 2.5. Association graph and bit vector for each large item
            in Figure 2.4.



2.4.2 Large Itemset Generation

In the large itemset generation phase, the DLG algorithm
generates large k-itemsets Lk (k > 2).  For each large k-itemset
Lk (k > 2), the last item of the k-itemset is used to extend the
itemset into k+1-itemsets.  Suppose {i1, i2, . . , ik} is a
large k-itemset.  If there is a directed edge from item ik to item u,
then the itemset {i1, i2, . . , ik} is extended into k+1-itemset
{i1, i2, . . , ik, u}.  The itemset {i1, i2, . . , ik, u} is a
large (k+1)-itemsets if the number of 1s in BVi1 ^ BVi2 ^..^ BVik ^ BVu
is no less than the minimum support. If no large k-itemsets can be
generated, the DLG algorithm terminates.

For example, consider the database in Figure 2.5.
In the second phase, the large 2-itemsets L2 = {{A,C},{B,C},{B,E},{C,E}}.
For large 2-itemset {B, C}, there is a directed edge from the
last item C of the itemset {B, C} to item E. The number of 1s in 
BVb ^  BVc ^  BVe (i.e., (0110)) is 2.  Hence {B, C, E} is a large
3-itemset. The DLG algorithm terminates because no large 4-itemsets
can be generated.


3. MINING ASSOCIATION RULES FROM REMOTELY SENSED DATA

3.1 Problem Definition

A large store of land-process data, collected from the LANDSAT
series of Earth Observing Satellites (part of EOS), is available from
the USGS (United State Geological Survey) EROS (Earth Resources
Observation System) Data Center. One of the goals is to analyze these
data and derive some useful rules for guiding farm managers.
The algorithm we described in Chapter 2 is for mining association
rules from market basket data. In precision agriculture applications,
the attributes are spectral bands.  The values in each band represent
a reflectance level of pixels.
The domain of each spectral band has values ranging from 0 to 255.
The band value is usually represented as an 8-bit binary,
and a combination of several bands is stored as an image file.
This problem is different from basket data. 
Generally speaking, an attribute value can be classified into
two types: quantitative and category.  Boolean attributes can be
considered as a special case of categorical attributes. 
Now, the problem is to mine association rules from quantitative
attributes. Data mining from large, remotely sensed data refers
to mining association rules, which tell the relationships or
patterns among part of or all these quantitative attributes.

An example of an association rule for precision agriculture might
be "combinations of intervals of reflectance from certain bands
will imply certain agricultural phenomena like high yield." 

An example is described as follows.  We have an image scene and
a yield map. They are both digit photos taken from the same field.
The image scene contains red, blue, and green bands.
It represents the reflectance levels of each pixel of the scene.
The yield map is visualized as a RGB image.
The actual yield values are recorded in an 8-bit gray scale (one 
band).  Now, there are five attributes in the database.
They are pixel-number, red (band 1), green (band 2), blue (band 3),
and yield (band 4). The problem of mining association rules 
from this remotely sensed data is to discover the associations
between band 1, band 2, band 3, and band 4.
These association rules will help farm managers to understand what 
combination of spectral bands will have a high crop yield.

Note that band is a quantitative attribute.
First of all, we should handle quantitative 
attributes in the data mining process. 


3.2 Partitioning Quantitative Attributes

The Apriori and DLG algorithms can only handle category
basket data.  When quantitative attributes are in the database,
We cannot apply the techniques of mining association rules in a
transaction database (basket database) directly.  In a transaction 
database, say that a specific transaction included a set of items.
It means that some items appear in this transaction.
We can use  a boolean value of TRUE or FALSE to express the 
relationship between the item and the transaction.
For quantitative attributes, say one bag of bread and
ten bags of bread, it is unable to tell the difference
between these two values in this way.

Few papers talk about mining quantitative association rules
in relational databases.  IBM Almaden Research Center [SA96]
focuses on mining quantitative association rules and propose
a straightforward method. The centers researchers simply
partition the attribute values into intervals;
the intervals are mapped to consecutive integers, such that 
the order of the intervals is preserved.
This approach treats a database record as a set of 
 pairs, without the loss of generality.
For example, we partition the attribute band into four intervals.
We can use equal depth to separate the values. Then, we have 
four attributes in the database; they are { band, 0..63},
{ band, 64..127}, { band, 128..191}, and {band, 192..255}.
The "item" {band, interval 1} would be "1" if the band 
had a value within the interval 1 in the original tuple
and "0" otherwise. Figure 3.1 shows the equal depth partition. 

Pixel band  band,0..63   band,64..127   band,128..191   band,192..255
 1    80    0            1              0               0
 2    120   0            1              0               0
 3    180   0            0              1               0
 4    220   0            0              0               1
 5    30    1            0              0               0

Figure 3.1.   An example of equal depth partitioning.


There are two problems with equal depth partitioning.
If the number of intervals for a quantitative attribute is large,
the support for any single interval can be low.
Hence, some rules involving this attribute may not be found
because they lack minimum support. If the number of intervals
for a quantitative attribute is small, some information will
be lost when the confidence is small. 

In real world applications such as precision agriculture,
algorithms that only provide equal depth partitioning may not
be sufficient. In this paper, we discuss several ways to divide
quantitative attributes into uneven depth and discontinuous intervals.
The general idea is to allow users to interact with the mining
engine during the whole process of data mining.
For equal depth partitioning, we further consider the fact that
remotely sensed data are in binary form.  Therefore, a partition
based on twos power can facilitate the execution of the algorithm.
(Details are found in Section 3.4.)

 
3.3 Uneven Depth and Discontinuous Intervals Partition

For quantitative attributes in a relational database, we separate
the attribute domain into several intervals.
Separating intervals for quantitative attributes is a difficult task.
One feature of this paper allows user interaction so that
users can apply their domain knowledge in the whole process of
data mining.  Doing this will increase the accuracy of the 
association rules and the efficiency of the mining algorithm.

In the precision agriculture application, our data source is the
image scene.  The schema of  the image file is shown in Figure 3.2. 
  

Image file
  __________________________________________
 |Pixel | Band 1 | Band 2 | . . . | Band n  |
 |______|________|________|_______|_________|

Figure 3.2. Image file schema.



3.3.1  Uneven Depth Partition

The attribute pixel is a positive consecutive integer and values
starting from "1". Values in each band represent the reflectance
level of pixels. The domain of each spectral band has values
ranging from 0 to 255. It is wise to allow the users to
separate interval into uneven depths depending on their domain
knowledge. This approach is almost the same as  the previous one.
For the attribute band,  assume users know that there is not
much value between 128~255. It is better to separate the interval,
for example, into [0,31], [32,63], [64,127], and [128,255].
Then, you can convert the attribute band into four attributes;
they are: band,0..31, band,32..63, band,64..127, and
band,128..255. The band,interval-1 would be "1" if the band
had a value within interval 1 in the original tuple and "0" otherwise.
Figure 3.3 shows an example of uneven depth partitioning.

Pixel band  band,0..31   band,32..63   band,64..127   band,128..255
1     80   0             0             1              0
2     120  0             0             1              0
3     180  0             0             0              1
4     220  0             0             0              1
5     30   1             0             0              0

Figure 3.3.   An example of uneven depth partitioning.


Compare Figures 3.1 and 3.3. You can see that, for different partitions,
the same pixel will map into different intervals.
This partitioning will directly affect the quantity of the
association rules mined later.



3.3.2  Discontinuous Intervals Partition

In precision agriculture, each band represents the reflectance level
of pixels.  Some attributes reflect certain application-specific
data, e.g., various crop-related phenomena such as soil types,
yield, and quality. Some experts suggest that the green band has the 
closest relation with high quality. From the image, it may have
red, yellow, blue, green, black, purple, and orange bands.
In this case, we am only concerned with the green band and 
the quality band.  Another typical example in precision agriculture
is the Normalized Difference Vegetation Index (NDVI).
NDVI produces a simple spectral Vegetation Index that separates
green vegetation from its background soil brightness using
Landsat MSS digital data.  In this situation, we am concerned with
the green band within 128~191 and the blue band within 0~63.
We should allow users to select any intervals of any bands they are 
interested in and any output they want.
There is no need to keep the sequence of original band order.
Figure 3.4 gives an example of NDVI discontinuous intervals partitioning.

Pix band1  band2  band3    band1,128..191   band2,0..63   band3,128..255
   (green) (blue) (yield)  --------------   -----------   -------------- 
1   110     50     240      0                1             1
2   180     63     130      1                1             1
3   230     70     200      0                0             1
4   220     50     180      0                1             1
5   130     60     220      1                1             1

Figure 3.4.  An example of discontinuous partitioning.



3.4 Formal Model

Because remotely sensed data are binary in nature, we give a bit-oriented
model for the problem of mining association rules from remotely sensed
data.  We follow this model when we implement the algorithm.	

Given a database of imagery pixels, it is desirable to discover
the important associations among intervals from different bands
such that the presence of intervals from some bands will imply
the presence of intervals from other bands. 

We refer to "intervals in bands" as "items",
I={[b* 0*, b* 1*]i |i=1,..,m};
i is band number, and m is the total number of bands.
We use b1b2b3b4b5b6b7b8 to represent binary data.
The range of 8-bit binary is from 0000 0000 to 1111 1111
(i.e., 0 to 255).  We define b*=b1b2..bd (1st-list),          8-d
where d is called diameter.       We have  interval depth = 2    .  
For example, if d = 3, the interval depth = 2 POWER (8-3) = 32.
the number of intervals = 2 POWER d = 8, and intervals are
defined as [0, 31], [32, 63], [64, 95], [96,127], [128,159],
[160, 191], [192, 223], and [224, 255].
We define 0*=0..0, 1*=1..1.  The number of bits is equal to the 
value of diameter d.  When d = 3, we have corresponding 0* = 000
and 1* = 111. Figure 3.5 shows different expressions of item.
(d = 3, There are 8 items in Figure 3.5.)

[b* 0*, b* 1*] Binary expression   Decimal expression
000            00000000, 00011111  [0, 31]
001            00100000, 00111111  [32, 63] 
010            01000000, 01011111  [64, 95]
011            01100000, 01111111  [96, 127]
100            10000000, 10011111  [128, 159]
101            10100000, 10111111  [160, 191]
110            11000000, 11011111  [192, 223]
111            11100000, 11111111  [224, 255]

Figure 3.5.   Different notations of item.
	

Let me refer to the collection of all pixels as the
transaction set, T = {(t(b*,i)) | t(b*,i) in  {0, 1}} =
{{(b*1, i1),..,(b*m, im) | t(b*k,ik) = 1}. 
Each transaction has a unique number called a pixel number.
An association rule is an implication of the form X implies Y,
where X = {collection of reflectance interval},
Y = {output interval}, X, Y in I and X INTERSECT Y = empty. 

Clearly, with user defined diameter d, we can partition the
band into equal depth intervals.  Further, we can use bit-wise
ANDing to count the occurrence of the pattern b1b2..bd
(first-list) instead of testing to see if a value falls within
an integer range. Doing bit-wise ANDing will improve the
efficiency of the database scan, which is an important 
implementation issue.


3.5  Finding Large Itemsets from Imagery Data

In data mining, it is essential to collect a sufficient amount
of data so that we can derive meaningful conclusions from them.
As a result, the amount of these data tends to be huge.
There are 40 million pixels in one image scene.
Efficiency is an important issue for a mining algorithm. 

Finding a large itemset algorithm can be done iteratively in the
sense that the large itemsets discovered in one iteration will be
used as the basis to generate the candidate itemsets for the next
iteration. For example, in the Apriori algorithm at the kth iteration, 
large k-itemset is generated. In the next iteration, a heuristic
is used to extend large k-itemsets into candidate (k+1)-itemsets.
The heuristic used to construct the candidate itemsets of large
itemsets is crucial to performance. Clearly, in order to be efficient,
the heuristic should only generate candidates with high likelihoods
of being large itemsets because, for each candidate, we need to
count its appearances in the database. The larger the candidate
itemsets, the more processing cost required to discover the large
itemsets. A performance study  shows that the candidate itemsets
generated during an early iteration are generally, in orders of
magnitude, larger than the set of large itemsets that it really 
contains. Therefore, the initial candidate set generation,
especially for the large 2-itemsets, is the key issue to improve
the performance of data mining [PCY97].  


3.5.1 New Pruning Techniques for Fast Data Mining

We proposed two effective pruning techniques for candidate itemset 
generation to progressively improve the efficiency of the base algorithm.


3.5.1.1  Technique One

Observation 1. A pixel value cannot belong to two different
intervals from the same band.

Observation 2. The combination of k intervals (k > 1) from the
same band has support zero.

>From Observation 2, we know that, for any candidate k-itemsets
(1 < k < n) generated from the same band, the support for this
candidate set must be zero. It is impossible for these candidate
k-itemsets to become large k-itemsets later.  The new pruning
technique proposed in this paper is that algorithm do not need
to generate the candidate k-itemsets among those "items" which
are in the same band with different intervals.

We compare the Apriori algorithm and my new algorithm to generate
candidate 2-itemsets.  The notation is shown in Table 3.1.



Table 3.1.  Notation for comparison
     Ck  Candidate k-itemsets
     Lk  Large k-itemsets
  | Ck | Number of itemsets in candidate k-itemsets
  | Lk | Number of itemsets in large k-itemsets
      *  An operation for concatenation
     Rj  Number of intervals in band j
     

(1) Apriori algorithm.   According to the fact that any subset
of a large itemset must also have minimum support, Apriori uses
L1 * L1 to generate a candidate set of itemsets C2. 

          |C2|apriori  = |L1| ( |L1| - 1 ) / 2 
                            
                       = COMBO{ |L1|, 2 }

(combinations of |L1| things taken 2 at a time).



(2) New algorithm.   Based on Observation 2, we will not choose the
intervals from the same band to generate candidate 2-itemset.
Assume |L1| = R1 + R2 + . . + Rn.

|C2|new = R1(R2+R3+..+Rn)+R2(R3+R4+..+Rn)+..+(Rn-2)(Rn-1+Rn) + (Rn-1)Rn
           
SUM(j=1,n-1) COMBO{Rj,1} * COMBO{SUM(k=j+1,n)(Rk,1)}


The number of candidate 2-itemsets generated by the new algorithm
is much less than by Apriori.  

|C2|prune1 = |C2|apriori - |C2|new 

           = SUM(j=1,n)(COMBO{Rj,2})
 

Note that when n and Ri are large, |C2|prune1 becomes an extremely
large number.  For example, if the imagery data have 8 bands
and each band has 16 intervals, the number of pruned candidate
2-itemsets is  (8 x 16 x (16-1) / 2) = 960. It sharply reduces the 
process cost.  The new algorithm employs effective pruning
techniques to progressively reduce the number of candidate
2-itemsets, thus improving the performance bottleneck of 
the whole process.



3.5.1.2   Technique Two

During the process of data mining, allowing user interaction with
the mining engine and utilizing users prior knowledge will help
to speed up the mining algorithms by restricting the search space.

In application of precision agriculture, farm managers like
to know "combinations of intervals of reflectance from certain
bands imply certain agricultural phenomena like high yield."
It means farm managers are only interested in those rules
that reflectance bands appear in the antecedent and other bands
appear in the consequent. The mining task is to find the
association rules X implies Y, where  X = {collection of
reflectance interval}, Y = {output interval}, and X INT Y = empty. 

The users knowledge gives me some heuristic that, if the large
itemsets which we mined do not contain Y, the rules derived later
should be uninteresting to the user. In this paper, we propose
another pruning technique for candidate itemset generation based
on this heuristic.

Assume there are n bands in an imagery data; 1,..,R are indexes
of reflectance bands, and R+1,..,N are indexes of output bands.
The association rules we mined should have the form:

band1 ^..^ bandR ^ band(R+1) ^..^ bandN.

(1) Consider only one band, "bandN," in output.
The association rule is the form band1^..^band(N-1) IMPLIES bandN.

For candidate 2-itemsets, since we know that band N should be
in every candidate itemset, the new algorithm only generates those
candidate itemsets in which one item is  band N interval and another
is band k (k = 1,..,N-1) interval. Assume the candidate 2-itemsets is

{  band-N,range,  band-k,range },   (k=1,..,N-1). 

Then, the number of candidate 2-itemset is

       |C2|new = COMBO{RN,1} * COMBO{SUM(j=1,N-1)(Rj),1}

The reason is we are not interested in those itemsets which do not
contain band N is that we will prune those candidate itemsets in
which none of the interval is chosen from band N. The number
of pruned candidate 2-itemset is 

|C2|prune2 = SUM(j=1,N-2)(COMBO{Rj,1}*COMBO{SUM(k=j+1,N-1)(Rk),1}


Apply the new pruning technique described in 3.5.1.1; the
number of pruned candidate 2-itemset is  
        
       |C2|prune1 = SUM(j=1,N)COMBO{Rj,2}


The total number of pruned candidate 2-itemsets is

 |C2|prune = |C2|prune1 + |C2|prune2 
   
           = COMBO{SUM(j=1,N-1)(Rj), 2}  +  COMBO{RN, 2}


The remaining steps are the same as in the Apriori algorithm.



(2) Consider two bands in the output. The association rule is of
    the form     

       band1 ^..^ band(N-2)  IMPLIES  band(N-1) ^ bandN. 


For candidate 2-itemsets, the new algorithm only generates
those candidate itemsets in which one item is in either
band(N-1) intervals or band N intervals. Assume the 
candidate 2-itemsets is

{ ,   }    (k=1,..,N-2)  or 

{ ,   }        (k=1,..,N-1). 


The number of candidate 2-itemsets  is

|C2|new = COMBO{RN-1, 1} * COMBO{SUM(j=1,N-2)(Rj), 1} +
          COMBO{RN, 1} * COMBO{SUM(j=1,N-1)(Rj), 1}

We prune those candidate itemsets in which the interval is
chosen from neither band (N-1) nor band N.
The number of pruned candidate 2-itemsets is

|C2|prune2 = SUM(j=1,N-3)( COMBO{Rj,1}*COMBO{SUM(k=j+1,N-2)(Rk),1} )


Apply the new pruning technique described in 3.5.1.1;
the number of pruned candidate 2-itemsets is

|C2|prune1 = SUM(j=1,N)(COMBO{Rj,1}


The total  number of pruned candidate 2-itemsets is
            
|C2|prune = |C2|prune1 + |C2|prune2 

    = COMBO{SUM(j=1,N-2)(Rj),2} + COMBO{RN-1,2} + COMBO{RN,2}


The remaining steps are the same as the Apriori algorithm.



 (3).  In general case, the association rule is the form  

band1 ^..^ bandM IMPLIES band(M+1) ^..^ bandN. 

It means there are (N-M) bands in the output.
We only generate those candidate itemsets in which at least
one is in band j intervals (j = M+1,..,N).
Assume the candidate 2-itemsets is 

{ ,  }  where k=1,..,M,  j=M+1,..,N
or
{, }  where m=M+1..N-1; n=m+1..N.


The number of candidate 2-itemsets is

|C2|new = SUM(j+M+1,N)(COMBO{Rj,1}) * COMBO{SUM(k=1,M)(Rk),1} +

          SUM(j+M+1,N-1)( COMBO{Rj,1} * COMBO{SUM(k=J+1,N)(Rj),1} )


We will prune those candidate itemset in which none of the interval
is chosen from bandj (j = M+1,..,N). The number of pruned candidate
2-itemsets is

|C2|prune2 = SUM(j=1,M-1)( COMBO{Rj,1} * COMBO{SUM(k=j+1,M)(Rk,1)} )


Apply the new pruning technique described in 3.5.1.1; the number of
pruned candidate 2-itemsets is 

|C2|prune1 = SUM(j=1,N)(COMBO{Rj,2})


The total number of pruned candidate 2-itemsets is

|C2|prune = |C2|prune1 + |C2|prune2 

          = COMBO{SUM(k=1,M)(Rk),2} + SUM(j=M+1,N)( COMBO{Rj,2} )


The remaining steps are the same as the Apriori algorithm. 

>From a mathematical point of view,  if Rj and N are large, it means
that the number of intervals and bands is large.
My new algorithm is efficient for pruning unnecessary candidate
2-itemset, thus it greatly improves the performance of the whole mining process.



3.6 Summary of the New Algorithm

The following paragraph summarizes the phases of the new algorithm by adding 
new pruning techniques to the base algorithm.

Phase1. Choose one of the partition methods (equal depth, uneven depth,
and discontinuous partition) to determine the intervals.

Phase2.  From the large 1-itemset, apply new pruning technique (technique
one and technique two) to generate candidate 2-itemset.

Phase3.  Apply the remaining steps of the Apriori algorithm.



3.7  An Example for Applying the New Algorithm

Consider the imagery data in Table 3.2. Assume this table contains four
attributes   (bands) and has five tuples (pixels).


Table 3.2.  An example database for the new algorithm

Pixel band1 band2 band3 band4
1     40    140   200   240
2     50    130   210   250
3     45    135   210   190
4    100    180    50   100
5    110    170    40   120

>From the domain knowledge, we know that band 1, band 2, and band 3 refer
to the  reflectance data, and band 4 refers to yield data.
The association rules the user likes to mine are of the form:
band1 ^ band2 ^ band3 ^ band4.  Now, we know that band1, band2, 
and band3 are input data, and band4 is output data.

Following are the three phases for applying the new algorithm.

Phase1.  Assume user select equal depth for partitioning. Diameter two
is for band 1 and band 4;  diameter three is for band 2 and band 3.
The number of intervals equals 2 POWER d. (See notation for corresponding
band interval and diameter in Tables 3.3 and 3.4.)


Table 3.3.  Diameter two for band 1 and band 4
       
         [0,63] [64,127] [128,191] [192,255]
 band1    b11    b12      b13       b14
 band4    b41    b42      b43       b44


Table 3.4.  Diameter three for band 2 and band 3

      [0,31] [32,63] [64,95] [96,127] [128,159] [160,191] [192,225] [226,255]
band2  b21    b22     b23     b24      b25       b26       b27       b28
band3  b31    b32     b33     b34      b35       b36       b37       b38


After selecting a partition method, we map each value of Table 3.2 into
intervals.  Figure 3.6 shows the result.


Pix b11 b12 b13 b14 b21.. b25 b26.. b28 b31 b32.. b37 b38 b41 b42 b43 b44 
1    1   0   0   0  0      1   0     0   0   0     1   0   0   0   0   1
2    1   0   0   0  0      1   0     0   0   0     1   0   0   0   0   1
3    1   0   0   0  0      1   0     0   0   0     1   0   0   0   1   0
4    0   1   0   0  0      0   1     0   0   1     0   0   0   1   0   0
5    0   1   0   0  0      0   1     0   0   1     0   0   0   1   0   0

Figure 3.6.   An example of partitioning the value into intervals.

Phase2.

We used the Apriori algorithm as a base mining algorithm and apply my new
pruning technique for candidate 2-itemset generation.  We assumed that
minsup = 40% and minconf = 60%. 

(1) Candidate 1-itemset.
All the intervals we separated belong to candidate 1-itemset. They 
are {b11, b12, b13, b14, b21, b22, b23, b24, b25, b26, b27, b28,
b31, b32, b33, b34, b35, b36, b37, b38, b41, b42, b43, b44}.

(2) Large 1-itemset. Since minsup = 40%, the minimum support for the
number of transactions is at least two.  We scan the database and count
the support for each candidate 1-itemset, prune those with support is
less than two;  We can get the large 1-itemset {b11(3), b12(2), b25(3),
b26(2), b32(2), b37(3), b42(2), b44(2)}. (The number in the brace refers to 
support count.)

(3) Candidate 2-itemsets.  Applying the new algorithm, We generate the
candidate 2-itemsets 

|C2|new = 2 x (2 + 2 + 2) = 12. 

They are {{b42, b11}, {b42, b12}, {b42, b25}, {b42, b26}, {b42, b32},
{b42, b37}, {b44, b11}, {b44, b12}, {b44, b25}, {b44, b26}, {b44, b32},
{b44, b37}}. 


According to the formula we derived,
applying pruning technique one results in

|C2|prune1 = 1 + 1 +1 + 1 = 4. 


Applying pruning technique two results in 

|C2|prune2 = 2 x (2 + 2) + 2 x 2 = 12.


 The total pruned number of candidate 2-itemsets is 

|C2|prune1  +  |C2|prune2 = 4 + 12 = 16. 


Applying the Apriori algorithm,
the number of candidate 2-itemsets is   

|C2|apriori = (8 x 7) / 2 = 28.


The percentage pruning is 57.
The execution efficiency of the mining process is improved.



Phase3. Remaining steps are the same as the Apriori algorithm.

(4) Large 2-itemsets.
Scan the database, and count the support for each candidate 2-itemsets.

This results in the large 2-itemsets:
{{b42, b12}(2), {b42, b26}(2), {b42, b32}(2), {b44, b11}(2),
{b44, b25}(2), {b44, b37}(2)}.

(The number in the brace refers to support count.)


(5). Candidate 3-itemsets.
Applying the Apriori algorithm results in:  {{b42, b12, b26}, 
{b42, b12, b32}, {b42, b26, b32}, {b44, b11, b25}, {b44, b11, b37},
{b44, b25, b37}}.


(6) Large 3-itemsets. Scan the database, and count the support for
each candidate 3-itemsets. This results in the large 3-itemsets:
{{b42, b12, b26}(2), {b42, b12, b32}(2), {b42, b26, b32}(2),
{b44, b11, b25}(2), {b44, b11, b37}(2), {b44, b25, b37}(2)}}.

(The number in the brace refers to support count.)


(7) Candidate 4-itemsets. Applying the Apriori algorithm results in:
{{b42, b12, b26, b32}, { b44, b11, b25, b37}}.


(8) Large 4-itemsets. Scan the database, and count the support for
each candidate 4-itemsets. This results in the large 4-itemsets:
{{b42, b12, b26, b32}(2), { b44, b11, b25, b37}(2)}.


(9) Candidate 5-itemset is empty. The large itemset generation
algorithm terminates.


(10) Derive association rules. Apply domain knowledge. we am interested
in those rules in which band 1, band 2, and  band 3 appear in the
antecedent and band 4 in the consequent.

We derive the following association rules.

b12 ^ b26 ^ b32 IMPLIES b42 ,  with support = 40% and confidence = 66.7%.

b11 ^ b25 ^ b37 IMPLIES b44 ,  with support = 40% and confidence = 100%.



In precision agriculture, users domain knowledge helps me eliminate some 
unnecessary steps in derive rules. For example, we need not compute the
confidence for those rules, such as

b11 implies b44 ^ b25 ^ b37,
b25 implies b44 ^ b11 ^ b37,
b37 implies b44 ^ b11 ^ b25, 
b11 ^ b25 implies b44 ^ b37,
b11 ^ b37 implies b44 ^ b25,  and
b25 ^ b37 implies b44 ^ b11. 



4. IMPLEMENTATION
	
A tool that implemented the algorithm has been designed and employs
object-oriented technology. The programming language is JAVA. The major
consideration for implementation is the efficiency. As we discussed in
previous chapters, bit-wise operations are used through the implementation
to achieve maximum efficiency. A bit-oriented data structure similar to
relational table was implemented and, therefore, a join operation. The 
candidate set generation and rule generation functions implemented the new
pruning techniques. Finally, the tool was integrated with a SMILEY system.



4.1 SMILEY

Remotely sensed data will be used by potentially millions of users,
provided there is a good way to access and manipulate that data.
The common way to manipulate and analyze EOSDIS (Earth Observing
System Data and Information System) data is through a GIS
(Geographical Information System) system.  Usually, this method of
GIS distribution (both the software and the data) is costly and slow.
Furthermore, the cost of owning EOSDIS data is high.  These problems
may limit the application and use of remotely sensed data. 

SMILEY stands for Signature Miner & Interface Language for
Earth-systems, Yet-another.  SMILEY is a powerful on-line satellite
imagery analyzer and viewer.  SMILEY provides a general interface
from the World Wide Web using Internet technology to access, display,
and manipulate various sets of satellite data.  Users can eventually
use any computer platform from anywhere in the world to do imagery
data analyzing.  SMILEY uses the client/server model to fully
utilize the processing power of client machine's for very quick
response time.  Users can do data mining based on pixel values or
apply band functions.  They can also use many predefined filters
or even define their own filter function to process the imagery data. 



4.2  Integration with SMILEY


4.2.1 New Feature Added into SMILEY

SMILEY is a WWW-based remotely sensed imagery data mining system.
It has many useful image processing functions, including band mining
and pixel mining. The new feature we implemented and added to SMILEY
is the association rules mining algorithm.  It has been integrated
with SMILEY so that users can mine the association rules from
remotely sensed data.



4.2.2  Functionalities of the New Tool

The tool is displayed at the lower part of the SMILEY screen.
It has two panels on the screen.  The left panel allows the user to
select parameters. It has a set of band choices and a set of level
choices.  Yield band is separated as a special band.  Level has
four choices from 1 to 4.  It has the same meaning as a diameter.
Support and confidence are input through a specially designed
graphic object, integer slider. The right panel of the lower 
screen displays output.

The tool has two functions.  One is to display data distribution
for each attribute. Another is the association rule mining.

When a user selects bands and levels, the tool automatically displays
the data distribution of that band with the selected level in bar
chart format.  The bar chart tells the user, out of 250 * 250 pixels,
how many pixels fall into each range.  Figure 4.1 shows the example
for yield value distribution.  The purpose of this function is to
let a user have a feeling how each value is distributed so that he
or she can determine a reasonable value for support and a reasonable
depth for each band.

The second function is association rules mining.  There are two ways
to display the result.  One is using plain English to describe the
rules.  Another uses visualization.  Figure 4.2 gives the example
of displayed rules.  By clicking the "Association Rules" button, all 
rules with user-defined confidence and support will be displayed.
(See Figure 4.2.).  Before clicking the "Rule Visualization" button,
the user should select a rule by clicking the rule list on the
upper part of the display area.  Then click the "Rule Visualization"
button. A visualized map will be displayed in the left corner.
(See Figure 4.2.)  There are a maximum of four colors in total,
red, green, blue, and yellow.  For a rule of the form, X => Y,
red color means the pixel satisfies X but not Y.  Green means the
pixel satisfies Y but not X.  Blue means the pixel does not
satisfy X or Y. Yellow means the pixel satisfies both X and Y. 

Using this tool, users can interactively go through the whole
data mining process by tuning the number of bands, levels of each
band, support, and confidence. Interactivity is particularly useful
here because the mined rules may not meet users expectations.
		


Figure 4.1. An example of yield  band value distribution.



Figure 4.2.  An example of displaying the association rules.




5. CONCLUSION

We defined a new data mining problem--mining association rules from 
remotely sensed data--and its application in precision agriculture.
The problem presented a different task (i.e., have to handle
quantitative attribute). we discussed several different ways to
partition quantitative attributes into equal depth, uneven depth,
and discontinuous intervals.  Equal depth partitioning was
straightforward and easy to implement.  On the other hand, uneven
and discontinuous partitions can help to limit the search space and
improve the accuracy of the mined rule.

Since the efficiency of a mining algorithm is a very important
issue of data mining, we proposed two simple and effective pruning
techniques for candidate 2-itemset generation.  Performance analysis
was done through experiments. Experimental results show that by 
exploiting the nature of the problem and the characteristics of
remotely sensed data, we can prune a significant number of
unnecessary candidate itemsets during the very early phase 
of the mining process.  We presented an algorithm that applied
new pruning techniques to the base algorithm. We also implemented
a tool in JAVA.  This tool has been successfully integrated with
SMILEY, a WWW-based imagery viewer and analyzer. 


BIBLIOGRAPHY

[AIS93]
R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules 
Between Sets of Items in Large Database. Proc. ACM-SIGMOD 
International Conference, Washington, DC, May 1993.

[AS94]
R. Agrawal and R. Srikant, Fast Algorithms for Mining Association 
Rules. Proc. International Conference on Very Large Databases, Santiago, 
Chile, September 1994.

[CHY96]
Ming-Syan Chen, Jiawei Han, and Philip S. Yu, Data Mining: An 
Overview from a Database Perspective. IEEE Transaction on Knowledge 
and Data Engineering, pg 866-881, Vol. 8, No. 6, December 1996.

[CNF96]
D. W. Cheung, U. T. Ng, A. W. Fu, and Y. J. Fu, Efficient Mining of 
Association Rules in Distributed Databases. IEEE Transaction on 
Knowledge and Data Engineering, pg 911-922, Vol. 8, No. 6, December 1996.

[DRE96]
Stuart E. Dreyfus, The Art and Theory of Dynamic Programming. 
ACADEMIC PRESS, INC, San Diego, CA, 1996.

[FL98]
Alex A. Freitas and Simon H. Lavington, Mining Very Large Database 
with Parallel Processing. KLUWER ACADEMIC PUBLISHERS,  
Dordrecht, the Netherlands, 1998.

[FPS96]
U.M. Fayyad, G. Piatesky-shapiro, P. Smyth and R. Uthurusamy, editors, 
Advances in Knowledge Discovery and Data Mining. MIT Press, 
Cambridge, MA, 1996.

[HKK96]
E.H. Han, G. Karypis, and Vipin Kumar, Clustering Based on  
Association Rule Hypergraphs. Proc. 1997 SIGMOD Workshop on 
Research Issues on Data Mining and Knowledge Discovery, Tucson, AZ, 
May 1997.

[PCY97]
J. S. Park, M. S. Chen, and P. S. Yu, Using a Hash-Based Method with 
Transaction Trimming for Mining Association Rules. IEEE Transaction 
on Knowledge and Data Engineering, pg 813-824, Vol. 9, No. 5, 
September/October 1997.

[SA96]
R. Srikant and R. Agrawal, Mining Quantitative Association Rules in 
Large Relational Tables. Proc. ACM-SIGMOD International Conference, 
Montreal, Quebec, Canada, June 1996.

[WS91]
J. Way and E. A. Smith,  The Evolution of Synthetic Aperture Radar 
System and Their Progression to the EOS SAR. IEEE Transaction on 
Geoscience and Remote Sensing,  pg  962-985, Vol. 29, No. 6, 1991. 

[YC96]
Show-Jane Yen and Arbee L.P. Chen, An Efficient Approach to 
Discovering Knowledge from Large Database. Proc. Fourth International 
Conference on PDIS,  Miami Beach, FL, 1996. 

*******************************

APPENDIX B:

Another view of precision ag data mining:

Here is a simple RSI datamining problem to solve:
Given an interval in the yield values,    y in [a,d)

data mine for intervals in the reflectance bands, i.e.,
b1 in [a1,d1), b2 in [a2,d2), . . .,  b7 in [a7,d7) such that if a pixel has
reflectances each of these intervals, then there is high confidence that it
will have yield in [a,d)

There are at least two measures of confidence to use, a symmetric confidence
ratio, scr, and an asymmetric confidence ratio, acr .  First some notation.

Let A=(a1,...,a7);
    D=(d1,d2,...,d7) and
[A,D)=( [a1,d1),...,[a7,b7) )

Let XAD be the set of pixels in question, namely satisfying:
    XAD = { p | bi(p) in [ai,di); i=1,..,7 } or we could write it as
                XAD = { P | B(P) in [A,D) } where caps indicate arrays;   and
    Yad = { p |  y(p) in [a,d) }  or  Yad = { P | y(P) in [a,d) }.

Then we are looking for rules,   XAD => yad,   which have high confidence.

Note that the intervals can be half open or universal by making either any
of the a's = 0 and/or any of the d's = 256

The symmetric confidence ratio is defined:
   scr(XAD=>Yad) = CARD{XAD INTERSECT Yad} / CARD{XAD UNION Yad}  which will
   be a large ratio (close to one) if most pixels satisfying XAD also satisfy Yad
   and also most pixels satisfying Yad also satisfy XAD (so that XAD
   signatures Yad, in the sense that XAD is a necessary and sufficient
   condition for high yield, for instance)

        .-XAD--------------------.
	| .----Yad---------------|.
	| |                      ||
	| |                      ||
	| |                      ||
	| |                      ||
	| |                      ||
	| |                      ||
	| |                      ||
	`-|----------------------'|
	  `-----------------------'

The asymmetric confidence ratio is defined:
   acr(XAD=>Yad) = CARD{XAD INTERSECT Yad} / CARD{XAD}  which will
   be a large ratio (close to one) if most pixels satisfying XAD also satisfy Yad
   but many pixels satisfying Yad do not satisfy XAD (so that XAD implies Yad
   but NOT XAD also implies Yad - that is, XAD is not the only interval implying Yad)
   (this would be a situation where we have discovered reflectances that give
   high yield, but that there may be other reflectances that also give high
   yield so XAD is a sufficient condition for high yield but not a necessary
   one).

        .-Yad--------------------.
	|                    .XAD-.
	|                    |   ||
	|                    `---|'
	|                        | 
	|                        | 
	|                        | 
	|                        | 
	|                        | 
	`------------------------' 
	                           
We may be interested in another asymmetric confidence ratio, which I will call
"reverse asymmetric confidence ratio or racr" as follows:
   racr(XAD=>Yad) = CARD{XAD INTERSECT Yad} / CARD{Yad} which would
   tell us, for instance. that XAD is a necessary condition for high yield but
   not sufficient.


        .-XAD--------------------.
	|                    .Yad-.
	|                    |   ||
	|                    `---|'
	|                        | 
	|                        | 
	|                        | 
	|                        | 
	|                        | 
	`------------------------' 
	                           

Just for completeness, I note that each of these confidence ratio definitions
can be given in inverted form (in which case. we would be looking for low ratios
rather than high), iscr, iacr and iracr as follows (I think each "inverted
ratio" is the reciprocal of the standard one).
  iscr(XAD=>Yad) = CARD{XAD SYMMETRIC-DIFFERENCE Yad} / CARD{XAD UNION Yad} 
  iacr(XAD=>Yad) = CARD{XAD MINUS Yad} / CARD{XAD} 
  iracr(XAD=>Yad) = CARD{Yad MINUS XAD} / CARD{Yad} 

and scr = 1/isrc
    acr = 1/iacr
   racr = 1/iracr   ?????



So we have to combine (join) a scheme:

Pixel  Band-one    Band-two    ...  Band-seven
-----  ---------   ---------        ---------
0      0000 0110   1101 0010        1010 0101
1      0000 0010   1101 0001        1010 0111
2      0000 0011   1101 0101        1001 0101
  .
  .
  .
40M    0000 1011   0001 1101        0101 0001

with Agricultural Practices data such as yield
and quality (eg, protein level) into one "large relation":


Pixel  Band-one    Band-two    ...  Band-seven   Yield   Quality
-----  ---------   ---------        ---------    -----   -------
0      0000 0110   1101 0010        1010 0101      120        13
1      0000 0010   1101 0001        1010 0111      121        14
2      0000 0011   1101 0101        1001 0101      121        12
  .
  .
  .
40M    0000 1011   0001 1101        0101 0001      154        18


and then do the mining described in the paper (looking for association
rules with a given level of confidence and support of the type:

"If   1000 0000 < band4 < 1000 1000    and
      0010 0000 < band3 < 1111 1111       

then  Yield > 140 bushel per acre and Quality > 13%



[ Perrizo's Home || NDSU Home ]

perrizo@plains.nodak.edu