MATHEMATICAL CONCEPTS
Some of the mathematical concepts need for this course are collected here.
The following diagram attempts to show the IS-A relationships among some of these concepts.
Cartesian Product
|
|
Relation
/ \
/ \
/ \
RelationExtension RelationIntension
/ \ / \
/ \ / \
/ \ Schema FunctionalDependency
/ \
/ \
Similarity BinaryRelation N-aryRelation
¦ \ / ¦ \ \
¦ \ / ¦ \ \
¦ Distance / ¦ \ \_____________________
¦ / ¦ \ \
¦ / ¦ \ \
¦ ReflexiveBinaryRelation ¦ 1-ManyRelationship Many-ManyRelation
¦ / \ ¦ \
¦ / \__ ¦ _ \
\ / ¦ \ \_____
\ / ¦ \ \
.-EquivalenceRelation---. ¦ .PartialOrderRelation-. \
| | ¦ | ¦ |______ \
Function(Classifier) | ¦ | ¦ | \ \
| UndirectedGraph ¦ | Lattice | Ontology
| | ¦ | | (ConceptHierarchy)
`-Partition(Clustering)-' ¦ `-DirectedAcyclicGraph' / \
¦ / / \
¦ / / \
¦ / PartOf IsA
Query / /
/ ¦ \ / /
/ ¦ \ ___/ /
/ ¦ \ / /
Select OLAP Join /
Project / \ /
\ / \ _____________________________________________/
\ / \ /
\ / \ /
Slice \ /
Dice Rollup
For purposes of these notes, we define (somewhat simplified):
A TUPLE on Domains (sets), A1..An, is an n-set (set with n elements),
{a1..an}, SuchThat : ForAll i, ai IsIn Ai
A RELATION, R(A1..An) on Domains, A1..An is a set of n-tuples on those domains.
One can model R as a subset of A1×..×An (Multidimensional Model or
one can model R as a set of A1..An tuples (Horizontal Model or
one can model R as any lossless collection of predicate maps from a key for the tuple set to {0,1} (Vertical Model).
A BINARY RELATION, R(A,B), on {A,B} can be modeled as a bipartite graph, G = ( A DisjointUnion B , R )
(i.e., the node set is the disjoint union of A and B and the edge set is R).
It is usually revealing to use a scatter plot (of (a,b) pairs in the A-B plane)
than connecting the a on the A-line with b on the perpendicular B-line.
An N-ARY RELATION, R(A1..A) can be modelled as an n-partite graph, G=(A1 DisjointUnion ... An, R).
(i.e., node set = {DisjointUnion{A1..An} and edge set = R, where an edge is an n-sided polygon).
Again it is often more revealing to display the edges as scatter plot points in A1×..×An
instead of displaying them as convex hulls of axis positions (i.e., displaying edge (a1...an)
as the convex hull connecting axis position a1 on axis A1, ..., , axis position an on axis An;
i.e., as the polyhedron connecting the n axis positions.)
A 3rd alternative is a slalom plot in which the n axes are positioned as side-by-side vertical lines
and the positions, ai on axes Ai are connected (i.e., each edge is represented by a line graph).
A FUNCTION, f, with domain D and range R, is a Relation on {D,R} SuchThat : ForAll d in D ThereExist (d,r) in f
and ForAll {d1,r1} and {d2,r2} in f, if d1 = d2 then r1 = r2.
A PARTIAL ORDER is a Reflexive Binary Relation, R, on S SuchThat
ForAll s in S, (s,s) IsIn R (reflexive)
if (s,t) and (t,s) AreIn R then s=t (antisymmetric)
if (s,t), (t,u) AreIn R then (s,u) IsIn R (transitive)
A LINEAR ORDER, R, on S is a Partial Order SuchThat ForAll s NotEqualTo t in S, (s,t) or (t,s) IsIn R
A RESTRICTION, L'=(S',«), of a partial order, L=(S,«) is a POset SuchThat
if S' IsSubSetOf S and if ForAll a,b in S' then a « b iff a « b
A PREFIX L' of L is a restriction SuchThat if a IsIn S' then all S-predecessors of a are in S'.
(Prefix closed under the predecessorship operation).
The definition says something about « and also about S'.
A SIMILARITY on a set, S, is a function, sim:S×S --- LOS (a linearly Ordered Set) SuchThat :
sim(x,y) = sim(y,x) ForAll x,y in S and
sim(x,x) = MaxVal(LOS) ForAll x in S
&ni:
An EQUIVALENCE RELATION is a Reflexive Binary Relation, R, on S SuchThat :
(s,s) IsIn R ForAll s in S (reflexive)
if (s1,s2) IsIn R then (s2,s1) IsIn R (symmetric)
if (s1,s2) and (s2,s3) AreIn R then (s1,s3) IsIn R (transitive)
Note that for a transitive similiarity with range {0,1}, (s1,s2) IsIn R iff sim(s1,s2)=1.
(S,R) is a LATTICE if R is a partial Order on S and ForAll s1&nes2 in S ThereExist s3 in S SuchThat:
(s1,s3) IsIn S and (s2,s3) IsIn S (i.e., every pair has an upper bound)
A PARTITION, P={C1..Cn}, of S is a subset of 2S = {s| s IsASubsetOf S} with
mutually exclusion ( ForAll i NotEqual j Ci Intersection Cj = empty ) and
collective exhaustion ( Union i=1..n Ci = S ).
A LABELED_PARTITION is SuchThat : ForAll Ci ThereExist a label assigned to it (from some label space, L)
A PARTITION LATTICE of S is lattice ordering of all Partitions of S under the
ordering of sub-partitions, where a sub-partition, Q, of a
partition, P, is SuchThat every component of Q is a component of P (only one).
A CONCEPT HIERARCHY of an attribute is Partition Lattice of that attribute
(any user defined concept hierarchy based on some domain knowledge,
is a sub-Lattice of this Lattice).
An ONTOLOGY is a controlled vocabulary and encyclopedia of all concept hierarchies
taken over an entire cohesive collection of related concepts.
A CLUSTERING is good Partition from the Partition Lattice of R (good = SuchThat :
intra-component pairs are very similar and
inter-component pairs are very dissimilar).
CLASSIFICATION is assigning a class value to an unclassified tuple
based on the current state of a Training Set.
(It assumes Training Set is sufficiently developed to have all relationships expected.)
Usually a Model is developed from the Training Set to do the class label assignment, which
can be thought of as a function which relates non-class attribute values to class label values
(E.g., Nearest Neighbors Vote; Decision Tree Model, Bayesian Model; Neural Network Model...).
An EAGER CLASSIFIER (e.g., Decision Tree, Bayes...) builds a model to represent
the feature-to-class information found in the Training Set.
(Can be viewed as selecting a partition of T &ni: whose Component Class Histograms
are SuchThat : maximal class is sufficiently more populous than the next highest).
A SAMPLE-BASED or LAZY CLASSIFIER (e.g., K-Nearest-Neighbor) builds no model ahead of time,
but the entire Training Set is employed for each classification.
(Can be viewed as finding a neighborhood around the unclassified tuple
SuchThat : the Class Historgram is sufficiently discriminatory.)
(Rings eminating out from the sample can be given decreasing vote values for a
better vote. (PINE method))
RULE MINING is discovering good antecedent&rArrconsequent relationships in 2T.
The Antecedent and Consequent sets can be defined as the TRUE SETS of tuple-predicates.
To reach the full generality of what rule mining encompasses, we allow the training table
to be rolled up to any level in any of its domain concept hierarchies (ontologies).
Thus, wrt the item-transaction terminology, itemsets are any collection of nodes,
at most one from any given concept hierarchy.
These nodes are sets of tuples (the set of tuples containing that node in that attribute).
Therefore the predicates are set-containment predicate wrt the original training table,
an item is a product of domain subsets, at most one per domain.
In Market Basket Research, where every domain is {0,1}, usually, the only allowable
antecedent-consequent subsets are the {1} subsets.
Thus, we can fully specify an itemset but simply specifying the attributes (items) involved
(so it is a schema-level specification).
A Functional Dependency in the Training Table is an intention level specification that within
given disjoint antecedent and consequent attribute sets exactly 1 100% confidence rule holds
&forall antecedent item, not only for the training set (current state) but also &forall future training
states as well (intentional rule)
GRAPHS A good Online Graph Theory Text
An UNDIRECTED GRAPH is a pair, G = (N,E) SuchThat :
N = set of nodes
E = set of 2-sets of nodes called edges.
Note that this definition does not allow a reflexive edge, (x,x).
If one wishes to allow such loop edges one needs to replace 2-set with 2-bag.
We diagram {n1,n2} in E as n1 -- n2
A PATH in G, (n1,n2..nk) is a sequence of nodes SuchThat : {ni,ni+1} in E ForAll i=1..k-1
A Graph is CONNECTED if there is a path connecting every pair of nodes.
G'= (N',E') is a SUBGRAPH of G if N' IsContainedIn N and E' IsContainedIn E.
G'= (N',E') is the SUBGRAPH of G INDUCED BY N' if N' IsContainedIn N and E = { {s,d} SuchThat E | s and d in N' }
G'= (N',E') is the CONTRACTION of G INDUCED BY Partition, N'
(n is partition component containing n)
E' = { {n1,n2} | {n1,n2} in E and n1 NotEqualTo n2 in N' }
A PARTITION of G is a set of subgraphs of G G1=(N1,E1), G2=(N2,E2) ... Gk=(Nk,Ek)
where Ei IsContainedIn E and DisjointUnionOf{Ni} = G.
Each Gi is called a COMPONENT of the partition.
A Graph G = (N,E)
a b
O----------------O
/\ /|
/ \ / |
/ O g / |
/ |\ / |
/ | \ / |
/ | \ / |
/ | \ / |
/ | \/ |
O----------O O--------O
f e d c
N = { a,b,c,d,e,f,g }
E = { {a,b},{a,f},{a,g},{b,c},{b,d},{c,d},{d,g},{e,g},{e,f} }
Unconnected Graph
b f
O O
|\ |
| \ |
| \ |
| O c |
| / |
| / O
O e
d
A HYPERGRAPH is a pair, G=(N,E) of disjoint sets, here E={nonempty subsets of N}
A DIRECTED GRAPH (or digraph), is a graph, G=(N,E) SuchThat : ThereExist maps: Init:E -- N Term:E -- N
(e in E is directed from Init(e) to Term(e)
The sequence, n1..nk, is a PATH from n1 to nk if (ni, ni+1) in E for i=1..k-1
A Graph is
TRIVIAL if ni=nj i,j=1,...,k
SIMPLE if all ni1 and/or nk.
CYCLE is simple nontrivial path with n1=nk (simple closed path).
MINIMAL if for nodes ni and nj in the cylce and edge (ni,nj) in E
(ni,nj) is in the cycle
A Digraph
a b
O-------------- >O
^^ /|
/ \ / |
/ O g / |
/ ^^ / |
/ | \ / |
/ | \ / |
/ | \ / |
/ | \V V
O< --------O O< ------O
f e d c
N = { a,b,c,d,e,f,g }
E = { (a,b),(b,c),(c,d),(d,g),(b,d),(e,g),(e,f),(f,a),(g,a) }
DIRECTED ACYCLIC GRAPH (DAG) digraph containing no cycles.
SOURCE = node with outgoing but no incoming edges.
SINK = node with incoming but no outgoing edges.
ROOTED DAG = dag with unique source.
If (a,b) in E, a is called a PARENT of b and b is called a CHILD of a.
For a path from a to b, a is an ANCESTOR of b
and b is a DESCENDANT of a.
If b &ne a, a is a PROPER ANCESTOR of b and b is a PROPER DESCENDANT of a.
If b &ne a, and neither is an ancestor of other then they are UNRELATED.
A TOPOLOGICAL SORT of a digraph, G, is a sequence of all of the nodes of G SuchThat
if a appears before b in the sequence, then there is no path from b to a.
Proposition A.1: A digraph can be topologically sorted iff it is a dag.
TRANSITIVE CLOSURE: if G=(N,E) is a digraph and G+=(N,E') is SuchThat
(a,b) in E' iff there is a nontrivial path from a to b in G then G+ is a transitive closure of G.
Proposition A1.5: ForAll G there is one and only one transitive closure
i.e., G -- G+ is a function.
The transitive closure is obtained by including in E',
(ni,nj) for every non-trivial path in G from ni to nj.
Proposition A.2: G+ is a dag iff G is a dag.
A dag is TRANSITIVELY CLOSED iff G=G+.
A TREE is a rooted dag with unique path from root to each node.
A Partial order (PO) is: ( S, < ) where S is called the DOMAIN, and the binary relation, LessThan, is
irreflexive: a NotLessThan a
transitive: if a LessThan b and b LessThan c then a LessThan c
A Partial Order L'=(S',LessThan') is a RESTRICTION of L = ( S, LessThan ) on domain L'
if S' SubsetOf S and
if ForAll a,b in S', a LessThan' b iff a LessThan b
L'is a PREFIX of L (L' LessThanOrEqual L) if L' is restriction and a in S' then all S-predecessors of a in S'.
A Prefix is closed under the "predecessorship" operation.
The definition says something about LessThan' and also S'.
Often in a DAG we only show non-transitive edges,
i.e., if a « and b « c we do not include the edge (a,c).
Note that, given a POSet, L, we often diagram it as a DAG using the duality just described.
However, since a diagram is intended to help visualization, and therefore should not be cluttered,
we generally do not include all edges (We don't display the closure - in fact, usually we
display the minimal dag that corresponds to the POSet).
QUESTION: Is it correct to say THE mininmal dag here?
Is this theorem true? Given a dag, G=(S,E), ThereExistUnique subdag G'=(S,E') SuchThat
C(G') = C(G) and if G''=(S,E'') is a subdag of G' SuchThat C(G'') = C(G) then G'' = G'.
DUALITIES:
PARTITION IsEquivalentTo FUNCTION IsEquivalentTo
EQUIVALENCE RELATION IsEquivalentTo UNDIRECTED GRAPH
Assuming a Partition has uniquely label components
(that would be required for unambiguous reference).
The Partition-Induced Function takes a point to the label of its component.
The Function-Induced Equivalence Relation equates a pair iff they map to the same value.
The Equivalence Relation-Induced Undirected Graph has an edge for each equivalent pair.
The Undirected Graph-Induced Partition is its connectivity component partition.
PARTIAL ORDERED SET IsEquivalentTo DIRECTED ACYCLIC GRAPH
The Directed Acyclic Graph-Induced Partially Ordered Set contains ( s1, s2 ) iff it is an edge in the closure.
The Patially Ordered Set-Induced Directed Acyclic Graph contains ( s1, s2 ) iff it is in the POSET.