Cluster Analysis



What is Cluster Analysis?

- The process of grouping a set of physical or abstract objects
  into classes of similar objects.


overheads78581 What are some typical applications of clustering? - Business: Help marketers discover distinct groups in their customer bases and characterize customer groups - Biology: Derive plant and animal taxonomies; Categorize genes with similar functionality; Gain insight into structures inherent in populations; - Land use: Identify areas of similar land use in an earth observation database; - Insurance: Identify groups of houses in a city according to house type, value, geographic location; Identify policy holders with average claim costs - WWW: Classify documents on the web for information discovery. - Data Mining: Stand-alone tool to gain insight into the distribution of data, Observe characterisitics of each cluster; - Data clustering includes contributions from data mining, statistics, machine learning, spatial databases, biology, marketing. - As a branch of statistics, cluster analysis has been studied extensively - focused mainly on distance-based cluster analysis, - tools are built into S-Plus, SPSS, SAS - In machine learning, cluster analysis is an example of unsupervised learning. - does not rely on predefined classes or class-labeled training examples - form of learning by observation, rather than learning by example, - In conceptual clustering, a group of objects forms a class only if it is describable by a concept. - differs from conventional clustering, which measures similarity, based on distance. - Conceptual Clustering consists of two components: (1) it discovers the appropriate classes 2) it forms descriptions for each class, as in classification - The guideline of striving for high interclass similarity and low interclass similarity still applies. - In data mining, cluster analysis has focused on: - finding methods for efficient and effective cluster analysis in large databases, - scalability of clustering methods, - effectiveness of methods for clustering complex shapes and types of data, - high-dimensional clustering techniques, - methods for clustering mixed numerical and categorical data in large DBs - The following are typical requirements of clustering in data mining: - Scalability (cluster larger datasets in reasonable time?) - Deal with different types of attributes (binary, categorical (nominal), ordinal, mixtures) - Discovery of clusters with arbitrary shape (Euclidean or Manhattan distance produces spherical clusters with similar size and density) What about arbitrary shapes? ("spatial clustering" deals with shaped clusters) - Minimal requirements for domain knowledge to determine input parameters: (parameters such as # of clusters may be hard to determine apriori) Cluster algorithm should be robust and insensitive wrt to the inputs - Ability to deal with noisy data (a "noise" point is also called an "outlier") (insensitivity to outliers, missing data, unknowns, erroneous data..) - Insensitivity to order of input records - High dimensionality (human eye not good at judging cluster quality for more than 3 dimensions) - Contraint-based clustering (there may be side constraints as well as a "distance" ("spatial clustering" deals with side conditions also) - Interpretability and usability (user expect interpretable comprehensible an usable results) - Study of clustering methods proceeds as follows: present general categorizatoin of clustering methods, study each method in detail, including methods based on partitioning, hierarchical, density-based, grid-based, model-based examine high-dimensionality and do outlier analysis 8.2 Types of Data That Occur in Cluster Analysis (and how to preprocess them) - Suppose dataset to be clustered contains n objects, which may represent persons, houses, documents, countries, pixels, genes... - Clustering algorithms typically operate on either a "data matrix" or a "dissimilarity matrix" - Data Matrix (or object-by-variable structure or "two mode"): - represents n objects (persons?) by p variables (measurements or attributes) (such as height, weight, gender, race...). The structure is in the form of a relational table or n-by-p matrix (n objects, p variables) - in our spatial DM the objects are pixels and the attributes are bands. x11 x12 ... x1p x21 x22 ... x2p . . . : : : xn1 xn2 ... xxp - This the the relational or table view of the data, R(K,A1,...,Ap) where K is a key id attribute to identify objects uniquely and each Ai is a column in the Matrix. - in spatial DM this is just the REL organization in which each row is a tuple corresponding to a particular pixel. - Dissimilarity Matrix (or object-by-object structure or "one mode"): - Stories collection of proximities available for all pairs of n objects. Often represented by an n-by-n table: 0 d(2,1) 0 . . : : d(n,1) d(n,2)... 0 - Where d(i,j) is measured difference or dissimilarity between objects i & j. - d(i,j) is a non-neg number close to 0 when objects are similar or "near". - in our precision ag example, a distance measure might be: the distance between two tuples, t and t', is |2*t.Y + t.SM + t.N - (2*t'.Y + t'.SM + 't.N)| This was used, essentially, by Kaushik Das in his thesis work. We will look at his clustering software later (based on SOMs and NNs). - Many clustering algorithms operate on a dissimilarity matrix, but a data matrix can be transformed into a dissimilarity matrix. - in the spatial setting, this can be a prohibitively large matrix: - For a TM scene, it is ~(40,000,000)^2 /2 or 800,000,000,000,000 cells (800 trillion!) 8.2.1 Interval-scaled Variables (continuous of linear scale: weight,height,lat,lon,..) - units can effect clustering results (inches versus meters) - smaller units lead to larger ranges for that variable and therefore larger clustering effect for that variable. - To avoid units effects, data should be standardized - convert to "unitless" measurements by: (1) Calculate the mean absolute deviation for a variable (attribute), f, sf=(|x1f-mf|+|x2f-mf|+...+|xnf-mf|)/n mf=mean of f = (x1f+..+xnf)/n (2) Calculate the standardized measurement, or z-score: zif=(xif-mf)/sf sf=mean absolute deviation (dis from mean is not squared) (3) median absolute deviation... - Once standardized, similarity/dissimilarity calculated based on "distance": (1) Euclidean: d(i,j)=SQRT(|xi1-xj1|^2 +..+ |xip-xjp|^2) (2) Manhattan: d(i,j)= (|xi1-xj1| +..+ |xip-xjp| ) Both are reflexive ( d(i,i)=0 ), symmetric ( d(i,j)=d(j,i) ) and subtransitive ( satisfy triangle inequality d(i,j) <= di,h)+d(h,j) ) (3) Minkowski (generalization of both): d(i,j)=(|xi1-xj1|^q +..+ |xip-xjp|^q)^1/q (4) weighted Minkowski: d(i,j)=(w1*|xi1-xj1|^q +..+ wp*|xip-xjp|^q)^1/q A Categorization of Major Clustering Methods Partitioning Methods - Given a DB or n objects (tuples), a partitioning method constructs k partitions of the data, where each partition represents a cluster and k<=n - ie, classify into k groups, that together satisfy: (1) each group must contain >=1 object, (2) each object must belong to 1 group (partitions are mutially exclusive and collectively exhaustive) (can be relaxed to a fuzzy partition) - Given k (# partitions to construct) create initial partition, then use iterative relocation technique that attempts to improve partitioning by moving objects. - General criteria for good partitioning is that same-cluster objects are "close" and different-cluster objects are "far apart" - To achieve global optimality would require exhaustive enumeration of all posssibilities. - Heuristics: (1) k-means algorithm, where each cluster is represented by the mean value of its objects (2) k-medoids algorithm, where each cluster is represented by an object near the center (center= 1st moment - minimizes the sum of the distances from it to its cluster mates. center = 2nd moment, etc.) - Works well finding spherical clusters in small-medium sized datasets. Hierarchical Methods - Agglomerative (bottom-up) (starts with each object in its own cluster) Divisive (top-down) (starts with all objects in one cluster) Agglomerative step0 step1 step2 step3 step4 (AGNES) -----+----------+----------+----------+----------+-- > a--------. ab-----------------------------. b--------' abcde c-------------------------------cde------' d-------------------. / de-------' e-------------------' Divisive step4 step3 step2 step1 step0 (DIANA) <-----+----------+----------+----------+----------+---- AGNES (AGglomerative NESting) places each object in its own cluster initially. 2 clusters are merged iteratively according to some criterion, usually minimum cluster distance (see options for distance between 2 clusters below). DIANA (DIvisive ANAlysis) all objects form one cluster initially. Clusters are split according some principle, usually maximum pairwise cluster distance. In either user can specify desired number of clusters as a termination condition. Four widely used cluster distances are: (where |p-q| is distance between objects) 1. Minimum distance: Dmin(Ci,Cj) = min(p in Ci, q in Cj)|p-q| 2. Maximum distance: Dmax(Ci,Cj) = max(p in Ci, q in Cj)|p-q| 3. Mean distance: Dmean(Ci,Cj) = |mi - mj| 4. Average distance: Davg(Ci,Cj) = 1/(ni*nj) SUM(p in Ci)SUM(q in Cj)|p-q| - Suffer from the fact that once a split or merge is done, it cannot be undone (result in error?). - Improvements: (1) perform careful analysis of object linkages at each hierachical clustering (CURE and Chameleon) (2) integrate hierarchical agglomerative and iterative relocation by 1st using a hierarchical agglomerative algorithm & then refining the result using iterative relatcation (BIRCH) Density-based methods (non-distance based). - continue growing the given cluster as long as the density (# of objects or data points in the nghd) exceeds some theshold (DBSCAN, OPTICS) Grid-based Methods (all clustering operations are performed on a grid structure) - fast processing indepedent of # of data objects, and dependent only on # cells in each dimension of the quantitized space. (STING, CLIQUE, WaveClsuter) Model-based Methods (hypothesize a model for each of the clusters, and finds the best fit of the data to the given model. Should Ptrees lend themselves to a grid-based clustering methods? (Since the recursive quadrantization is a griding of the space) or is the griding usually on other than the key attribute?) However, if we grid (quadrantize) on the other attributes the resulting structure should serve the grid approach to clustering well. - Construct a Ptree of the Pcube? ____________________________ / / / / /| 3 =11 / / / / / | / / / / / | /______/______/______/______/ | / / / / /| | 2 =10 / / / / / | /| / / / / / | / | /______/______/______/______/ |/ | / / / / /| / | 1 1 =01 / / / / / | /| /| d /4------ >5 / / / | / | / | n /_^____/__.___/______/______/ |/ |/ | a / : / . / / /| | | | B 0 =00 / . / etc / / / | /| /| /| / . / / / | / | / | / | /______/_.____/______/______/ |/ |/ |/ | B | | . | | | | | | / a 0 = 00 | | . | | | /| /| /| / n | 0- >1 . | | | / | / | / | / d |______|/____:|______|______|/ |/ |/ |/ 2 | / :| | | | | / 1 = 01 | /| :| | | /| /| / | 2----- >3| | | / | / | / |______|______|______|______|/ |/ |/ | | | | | | / 2 = 10 | | | | | /| / | | | | | / | / |______|______|______|______|/ |/ | | | | | / 3 = 11 | | | | | / | | | | | / |______|______|______|______|/ 0 =00 1 =01 2 =10 3 =11 Band3 Gives a Ptree with fanout=8 (focus on the rootcounts of each tree only). Root .--------------------------'/// \\\`--------------------------. / .-----------------'// \\`-----------------. \ / / .--------'/ \`--------. \ \ / / / / \ \ \ \ P(0,0,0) P(0,0,1) P(0,1,0) P(0,1,1) P(1,0,0) P(1,0,1) P(1,1,0) P(1,1,1) ///||\\\ ///||\\\ ///||\\\ ///||\\\ ///||\\\ ///||\\\ ///||\\\ ///||\\\ /// || \\\ /// || \\\ .--------------------------'// / \ \\`-------------------. / .----------------'/ / \ \`--------------. \ / / .------' / \ `---. \ P(11,01,01) / / / / \ \ \ P(10,00,00)P(10,00,01)P(10,01,00)P(10,01,01) P(11,00,00)P(11,00,01)P(11,01,00) We certainly can look for grid based clusters in this tree, but it is LARGE in general. If we are interested only in "dense clusters" we could place 1 in a node only the octant has more than, e.g., twice its share (i.e., at depth-1: more than 1/4 of total count) etc. - then we have a Boolean tree which should identify clusters. - how about a 1-bit iff the octant has more than its share?? (compression not as good?) Partitioning Methods (more detail) - Given a database with n objects (tuples) and k=# of clusters to form, a partition aglorithm organizes objects into k partitions, where each partition represents a cluster. - The clusters are formed to optimize an objective partitioning criterion, often called a "similarity function", (e.g., distance) so that objects within a cluster are similar and objects of different clusters are dissimilar in terms of the database attributes. Classical Partitioning Methods: k-means and k-medoids The most well-known and commonly used partitioning methods are these. Algoritm: (k-means: based on the mean value of the objects in the cluster) Input: The number of clusters, k, and a database containing n objects. Output: A set of k clusters that minimize the squared-error criterion. Method: (1) arbitrarily choose k objects as the initial cluster centers. (2) repeat (3) (re)assign each object to the cluster to which the object is most similar based on the mean value of the objects in the cluster; ( using E=SUM(i=1..k)[ SUM(p in Ci)[|p-mi|^2]] where mi=mean of Ci ) (4) update the cluster means, i.e., calculate the mean value of the objects for each cluster; (5) until no change; P-trees might be used in applying k-means-CPM as follows. Data: X-Y B1 B2 B3 B4 0,0 0011 0111 1000 1011 0,1 0011 0011 1000 1111 0,2 0111 0011 0100 1011 0,3 0111 0010 0101 1011 1,0 0011 0111 1000 1011 1,1 0011 0011 1000 1011 1,2 0111 0011 0100 1011 1,3 0111 0010 0101 1011 2,0 0010 1011 1000 1111 2,1 0010 1011 1000 1111 2,2 1010 1010 0100 1011 2,3 1111 1010 0100 1011 3,0 0010 1011 1000 1111 3,1 1010 1011 1000 1111 3,2 1111 1010 0100 1011 3,3 1111 1010 0100 1011 If we consider only 4-bit values and the corresponding P-trees: P1,0000 P1,0100 P1,1000 P1,1100 P1,0010 P1,0110 P1,1010 P1,1110 0 0 0 0 3 0 2 0 0 0 3 0 0 0 1 1 3 3 0 P1,0001 P1,0101 P1,1001 P1,1101 P1,0011 P1,0111 P1,1011 P1,1111 0 0 0 0 4 4 0 3 4 0 0 0 0 4 0 0 0 0 0 3 ----------------------------------------------------------------------------0 P2,0000 P2,0100 P2,1000 P2,1100 P2,0010 P2,0110 P2,1010 P2,1110 0 0 0 0 2 0 4 0 0 2 0 0 0 0 0 4 4 P2,0001 P2,0101 P2,1001 P2,1101 P2,0011 P2,0111 P2,1011 P2,1111 0 0 0 0 4 2 4 0 2 2 0 0 2 0 0 0 0 0 4 0 ----------------------------------------4-1-------1-------------------------- P3,0000 P3,0100 P3,1000 P3,1100 P3,0010 P3,0110 P3,1010 P3,1110 0 6 8 0 0 0 0 0 0 2 0 4 4 0 4 0 1 P3,0001 P3,0101 P3,1001 P3,1101 P3,0011 P3,0111 P3,1011 P3,1111 0 2 0 0 0 0 0 0 0 2 0 0 ------------4---------------------------------------------------------------- P4,0000 P4,0100 P4,1000 P4,1100 P4,0010 P4,0110 P4,1010 P4,1110 0 0 0 0 0 0 0 0 P4,0001 P4,0101 P4,1001 P4,1101 P4,0011 P4,0111 P4,1011 P4,1111 0 0 0 0 0 0 11 5 3 4 0 4 1 0 4 0 ------------------------------------------------------------1---------1------ We first isolate only the nonzero trees: P1,0010 P1,1010 P1,0011 P1,0111 P1,1111 3 2 4 4 3 0 0 3 0 0 0 1 1 4 0 0 0 0 4 0 0 0 0 0 3 3 3 0 0 P2,0010 P2,1010 P2,0011 P2,0111 P2,1011 2 4 4 2 4 0 2 0 0 0 0 0 4 2 2 0 0 2 0 0 0 0 0 4 0 4 4 1 1 P3,0100 P3,1000 P3,0101 6 8 2 0 2 0 4 4 0 4 0 0 2 0 0 1 4 P4,1011 P4,1111 11 5 3 4 0 4 1 0 4 0 1 1 Then AND these to form 4*4*3*2=96 "tuple P-trees" (some may be zero trees): P-0010,0010,0100,1011 is a zero-tree (every P-0010,0010,xxxx,yyyy is a zero-tree) (every P-0010,bbbb,xxxx,yyyy is a zero-tree except bbbb=1011, the last one) (every P-0010,bbbb,xxxx,yyyy is 0-tree except bbbb=1011, xxxx=1000, yyyy=1111) P-0010,1011,1000,1111 3 0 0 3 0 3 Moving to P-aaaa,bbbb,xxxx,yyyy where aaaa=1010, the only combinations needed: P1,1010 2 0 0 1 1 3 0 P2,1010 P2,1011 4 4 0 0 0 4 0 0 4 0 P3,0100 P3,1000 6 8 0 2 0 4 4 0 4 0 1 P4,1011 P4,1111 11 5 3 4 0 4 1 0 4 0 1 1 P-1010,1010,0100,1011 1 0 0 0 1 0 P-1010,1010,0100,1111 is a zero-tree. P-1010,1010,1000,1011 is a zero-tree. P-1010,1010,1000,1111 is a zero-tree. P-1010,1011,0100,1011 is a zero-tree. P-1010,1011,0100,1111 is a zero-tree. P-1010,1011,1000,1011 is a zero-tree. P-1010,1011,1000,1111 1 0 0 1 0 3 Moving to P-aaaa,bbbb,xxxx,yyyy where aaaa=0011, the only combinations needed: P1,0011 4 4 0 0 0 P2,0011 P2,0111 4 2 2 2 0 0 2 0 0 0 4 1 1 P3,1000 8 4 0 4 0 P4,1011 P4,1111 11 5 3 4 0 4 1 0 4 0 1 1 P-0011,0011,1000,1011 1 1 0 0 0 3 P-0011,0011,1000,1111 1 1 0 0 0 1 P-0011,0111,1000,1011 2 2 0 0 0 1 P-0011,0111,1000,1111 is a zero-tree Moving to P-aaaa,bbbb,xxxx,yyyy where aaaa=0111, the only combinations needed: P1,0111 4 0 4 0 0 P2,0010 P2,0011 2 4 0 2 0 0 2 2 0 0 4 4 1 P3,0100 P3,0101 6 2 0 2 0 4 0 2 0 0 1 4 P4,1011 11 3 4 0 4 4 P-0111,0010,0100,1011 is a zero-tree P-0111,0010,0101,1011 2 0 2 0 0 4 P-0111,0011,0100,1011 2 0 2 0 0 1 P-0111,0011,0101,1011 is a zero-tree Moving to P-aaaa,bbbb,xxxx,yyyy where aaaa=1111, the only combinations needed: P1,1111 3 0 0 0 3 0 P2,1010 4 0 0 0 4 P3,0100 6 0 2 0 4 1 P4,1011 11 3 4 0 4 4 P-1111,1010,0100,1011, 3 0 0 0 3 0 So we get the following tuples: P-0010,1011,1000,1111 3 0 0 3 0 3 P-1010,1010,0100,1011 1 0 0 0 1 0 P-1010,1011,1000,1111 1 0 0 1 0 3 P-0011,0011,1000,1011 1 1 0 0 0 3 P-0011,0011,1000,1111 1 1 0 0 0 1 P-0011,0111,1000,1011 2 2 0 0 0 1 P-0111,0010,0101,1011 2 0 2 0 0 4 P-0111,0011,0100,1011 2 0 2 0 0 1 P-0111,1010,0100,1011, 3 0 0 0 3 0 Laying out the trees entirely in terms of counts: P-0010,1011,1000,1111 3 0 0 3 0 1110 P-1010,1010,0100,1011 1 0 0 0 1 1000 P-1010,1011,1000,1111 1 0 0 1 0 0001 P-0011,0011,1000,1011 1 1 0 0 0 0001 P-0011,0011,1000,1111 1 1 0 0 0 0100 P-0011,0111,1000,1011 2 2 0 0 0 1010 P-0111,0010,0101,1011 2 0 2 0 0 0101 P-0111,0011,0100,1011 2 0 2 0 0 1010 P-0111,1010,0100,1011 3 0 0 0 3 0111 Partitioning based (heavily) on pixel distance for similarity (that is, weighted heavily on the first attribute, x-y) we would (assuming, e.g., we want 3 clusters - k=3): (1) Ignore the counts and treat each level as a switch (yes/no). (note that no counts are lost - when trees get sparse, this is an viable alternative data structure which would be much more space efficient) PMTs: P-0010,1011,1000,1111 1 0 0 1 0 1110 P-1010,1010,0100,1011 1 0 0 0 1 1000 P-1010,1011,1000,1111 1 0 0 1 0 0001 P-0011,0011,1000,1011 1 1 0 0 0 0001 P-0011,0011,1000,1111 1 1 0 0 0 0100 P-0011,0111,1000,1011 1 1 0 0 0 1010 P-0111,0010,0101,1011 1 0 1 0 0 0101 P-0111,0011,0100,1011 1 0 1 0 0 1010 P-0111,1010,0100,1011 1 0 0 0 1 0111 Initially, pick the first k (recall, k=3) tuples with non-intersecting L1 switches as the initial clusters (and means of those clusters). P-0010,1011,1000,1111 1 0 0 1 0 1110 P-1010,1010,0100,1011 1 0 0 0 1 1000 P-0011,0011,1000,1011 1 1 0 0 0 0001 (2) repeat (3) assign objects based similarity, d(i,j) = #_L1_intersects (breaking ties with raster ordering priority) P-0010,1011,1000,1111 1 0 0 1 0 1110 P-1010,1011,1000,1111 1 0 0 1 0 0001 P-0111,0010,0101,1011 1 0 1 0 0 0101 P-0111,0011,0100,1011 1 0 1 0 0 1010 P-1010,1010,0100,1011 1 0 0 0 1 1000 P-0111,1010,0100,1011 1 0 0 0 1 0111 P-0011,0011,1000,1011 1 1 0 0 0 0001 P-0011,0011,1000,1111 1 1 0 0 0 0100 P-0011,0111,1000,1011 1 1 0 0 0 1010 How does this compare to the final decision tree partitioning from chapter 7? ( remembering that there are 5 distinct classes in this partition) .- >B3:0000- > C2:0011 |- >B3:0001- > C2:0011 |- >B3:0010- > C2:0011 .--- B2=0000 - > C2:0011 |- >B3:0011- > C2:0011 |--- B2=0001 - > C2:0011 |- >B3:0100- > C4:0111* |--- B2=0010 - > C3:0111 |- >B3:0101- > C2:0011 |--- B2=0011 -------------|- >B3:0110- > C2:0011 |--- B2=0100 - > C2:0011 |- >B3:0111- > C2:0011 |--- B2=0101 - > C2:0011 |- >B3:1000- > C2:0011 |--- B2=0110 - > C2:0011 |- >B3:1001- > C2:0011 B2 --|--- B2=0111 - > C2:0011 |- >B3:1010- > C2:0011 |--- B2=1000 - > C2:0011 |- >B3:1011- > C2:0011 |--- B2=1001 - > C2:0011 |- >B3:1100- > C2:0011 |--- B2=1010 - > C5:1111 |- >B3:1101- > C2:0011 |--- B2=1011 - > C1:0010 |- >B3:1110- > C2:0011 |--- B2=1100 - > C2:0011 `- >B3:1111- > C2:0011 |--- B2=1101 - > C2:0011 |--- B2=1110 - > C2:0011 `--- B2=1111 - > C2:0011 P-0010,1011,1000,1111 1 0 0 1 0 1110 in C1 P-1010,1011,1000,1111 1 0 0 1 0 0001 in C1 P-0111,0010,0101,1011 1 0 1 0 0 0101 in C3 P-0111,0011,0100,1011 1 0 1 0 0 1010 in C4 P-1010,1010,0100,1011 1 0 0 0 1 1000 all are in C5 P-0111,1010,0100,1011 1 0 0 0 1 0111 P-0011,0011,1000,1011 1 1 0 0 0 0001 P-0011,0011,1000,1111 1 1 0 0 0 0100 all are in C2 which is the predominant class. P-0011,0111,1000,1011 1 1 0 0 0 1010 (4) update means (mean of the switch for partition-1 is 0 1 1 0 For the others the they all equal the mean (no change)) No change so algorithm terminates. P-0010,1011,1000,1111 1 0 0 1 0 1110 P-1010,1011,1000,1111 1 0 0 1 0 0001 P-0111,0010,0101,1011 1 0 1 0 0 0101 P-0111,0011,0100,1011 1 0 1 0 0 1010 P-1010,1010,0100,1011 1 0 0 0 1 1000 P-0111,1010,0100,1011 1 0 0 0 1 0111 P-0011,0011,1000,1011 1 1 0 0 0 0001 P-0011,0011,1000,1111 1 1 0 0 0 0100 P-0011,0111,1000,1011 1 1 0 0 0 1010