A UNIFIED THEORY OF DATA MINING BASED ON RELATIONSHIPS AND THEIR GRAPHS


All Data Mining can be unified under the concept of relationship analysis

             (graph analysis is a dual concept).

             Horizontal vs. Vertical implementation is an orthogonal issue
                        (e.g., involved with performance, scalability, etc)







 DEGREE=2 UNIPARTITE relationships (between entity, N, and itself) can be modelled as:



 EdgeSet of G=(N,E):     E(N,N)   ={{ek,1, ek,2}| ek,1,ek,2∈N, k=1,…,|E|}



 Index on E(N,N):        E(N,Nset)={(n,Nsetn)|n∈N, Nsetn≡Set of nodes related to n}

  

 BitMapIndex on E(N,N):  E(N,Nmap)={(n,Nmapn)|n∈N, Nmapn≡Map of nodes related to n, so
                                                   Nmapn(k)=1 iff {k,f(k)}∈E }  where
                                                   f:{1…|N|}→N assigns positions to nodes.
    



          E.g., G=(N,E)   with N={n1,n2,n3,n4} and

          E(N,N)   (N,  N)
                    n1  n2
                    n1  n3
                    n1  n4
                    n2  n1
                    n2  n2
                    n2  n4
                    n3  n1
                    n3  n4
                    n4  n1
                    n4  n2
                    n4  n3

then


          E(N,ES)  (N,  Nset)
                    n1  {n2, n3, n4}
                    n2  {n1, n2, n4}
                    n3  {n1, n4}
                    n4  {n1, n2, n3}


          E(N,EM)  (N,  Nmap)
                    n1  0111
                    n2  1101
                    n3  1001
                    n4  1110





Using a HORIZONTAL APPROACH, one implements either E(N,Nset) or E(N,Nmap)
                   as a file of horizontal records (as shown pictorially above).

                   (usually E(N,EM) so that it is 1NF)




Using a VERTICAL   APPROACH one implements E(N,Nset)
                   as a set of vertical bit vectors by using some bit encoding

                   (standard encoding is just "bit slice encoding")
                   (bit vectors can be compressed into P-trees of dim 1,2,3..)
                   (0-dim Ptrees refers to the uncompressed bit vectors)








Given
          E(N,EM)  (N,  Nmap)
                    n1  0111
                    n2  1101
                    n3  1001
                    n4  1110


The BASIC 0-DIMENSIONAL P-TREES are:

         n1-slice,       n2-slice,       n3-slice,       n4-slice
             0               1               1               1
             1               1               0               1
             1               0               0               1
             1               0               1               0


The BASIC 1-DIMENSIONAL P-TREES are (tree link arrows are implied but not shown):

         n1-slice,       n2-slice,       n3-slice,       n4-slice
             0               0               0               0
          0     1         1     0         0     0         1     0
         0 1                             1 0   0 1             1 0



The BASIC 2-DIM P-TREES are:    (Note: also 3,4,5.. -DIM P-TREES (not fully populated yet)

         n1-slice,       n2-slice,       n3-slice,       n4-slice
             0               0               0               0
          0 1 1 1         1 1 0 0         1 0 0 1         1 1 1 0








 DEGREE=2 BIPARTITE RELATIONSHIP e.g. between T and I where N=T ∪! I

 5 representations with more efficient position mappings:
                                                          fI:{1,…,|I|}→I,
                                                          fT:{1,…,|T|}→T


 E(T,I)   = { {t,i} | t∈T and i∈I and {t,i}∈E}


 E(T,Iset)= { (t, Isett) | t∈T}  Isett≡{i|{t,i}∈E             set of is related to t

 E(T,Imap)= { (t, Imapt) | t∈T}  Imapt(k)=1 iff {t,fI(k)}∈E   map of is related to t


 E(I,Tset)= { (i, Tseti) | i∈I}  Tseti≡{t|{t,i}∈E             set of ts related to i

 E(I,Tmap)= { (i, Tmapi) | i∈I}  Tmapi(k)=1 iff {fT(k),i}∈E   map of ts related to i









The graph definitions above can be extended to relationships of degree > 2, but,
 we can already unify most of Data Mining with degree=2 graphs:









A Degree=2 UNIPARTITE RELATIONSHIP on N ∋:
                                           reflexive  (x,x)∈E ∀x∈N
                                           symmetric  (x,y)∈E ⇒ (y,x)∈E
                                           transitive (x,y), (y,z)∈E ⇒ (x,z)∈E
is an EQUIVALENCE RELATION




    A CLUSTERING) or PARTITION is a dual formulation of an equivalence relation
                  The partition, {Ci}, into equivalence classes is a clustering and vice versa.

                  A LABEL FUNCTION, L:{Ci}→Labels is also dual  (Cis are labeled by IDs)
                  The Pre-image partition, {L-1(Lk)}, is a clustering and vice versa.



      CLASSIFYING s ∈ S(A1..An) using training set, R(A1..An,L), is identifying the
                  best  R→R[L] pre-image cluster for s based on R.













 Degree=2 UNIPARTITE RELATIONSHIP on N

 which also satisfies

                            reflexivity   ( (x,x)∈E ∀x∈N )
                            anti-symmetry ( (x,y)∈E & (y,x)∈E ⇒ y=x )
                            transitivity  ( (x,y), (y,z)∈E ⇒ (x,z)∈E )

 is a 

 PARTIAL ORDERING (A dual formulation is a directed unipartite graph on N).








A Degree=2 BIPARTITE RELATIONSHIP on N = T ∪! I generates an


           I-Association Rule (I-AR), A⇒C, ∀A,C ⊆I, A∩C=∅ (disjoint Isets).

           T-Association Rule (T-AR), A⇒C, ∀A,C ⊆T, A∩C=∅ (disjoint Tsets).





An I-AR, A⇒C  T-frequent   iff  T-SUPPORT(A∪C)                ≥ MINSUPP

An I-AR, A⇒C  T-condfident iff  T-SUPPORT(A∪C) / T-SUPPORT(A) ≥ MINCONF


        (where T-SUPPORT(A)  ≡ |{t|(i,t)∈E ∀i∈A}|
         and T-SUPPORT(A∪C) ≡ |{t|(i,t)∈E ∀i∈A∪C}|
         and MINSUPP and MINCONF are user chosen)



A T-AR, A⇒C  I-frequent   iff  I-SUPPORT(A∪C)                ≥ MINSUPP

A T-AR, A⇒C  I-condfident iff  I-SUPPORT(A∪C) / I-SUPPORT(A) ≥ MINCONF





Any graph G=(N,E) has both Node labels and Edge Labels if only their identifiers

In general we assume node and edge labels are structures (as complex as needed to capture the semantics).



A distance function, d, on N can be modelled as a non-negative real valued edge label function
on the graph (N,E) with E=NxN, subject to the conditions:

 positive definite   d(x,y) ≥ 0     d(x,y)=0 iff x=y

 symmetric           d(x,y) = d(y,x)

 triangle inequality d(x,y) + d(y,z) ≥ d(x,z)







A similarity function, s, on N measures closeness rather than distance.


   Hamming Distance on a Boolean Table, R(A1..An) is dH(x,y)=|{i|xi≠yi}|
                    (count of bit positions where x and y differ)

   Hamming Similarity on a Boolean Table, R(A1..An) is dH(x,y)=|{i|xi=yi}|
                    (count of bit positions where x and y are the same)







 DEGREE=2 BIPARTITE EDGE LABELLED GRAPH with edge label function, l:E→EL   (let lt,i≡ l(t,i) 


 E(T,I,EL)   = { (t,i,lt,i) |  {t,i}∈ E


 E(T,I-ELset) = { (t,I-ELsett) | t∈T }   I-ELsett≡ set of (i,lt,i) ∋: {i,t}∈E

 E(T,I-ELmap) = { (t,I-ELmapt) | t∈T }   I-ELmapt(k,b)=1 iff {t,fI(k)}∈E and the 2b-bit of l{t,fI(k)}=1




 E(T,I-ELset) = { (i,I-ELseti) | i∈I }   I-ELseti≡ set of (t,lt,i) ∋: {i,t}∈E

 E(I,T-ELmap) = { (i,T-ELmapi) | t∈I }   T-ELmapi(k,b)=1 iff {fT(k),i}∈E and the 2b-bit of l{fT(k),i}=1







One can define, similarly,

 DEGREE=2 BIPARTITE NODE LABELLED GRAPH with node label functions, lT:E→TL and lI:E→IL

    ( let lTt≡ lT(t)  and   lIi≡ lI(i) )







The following short discussion on Petri Networks, points out that Petri Nets
are Directed Degree=2 Bipartite Graphs with fairly complex label structures.
(This is taken from WikipediA Wikipedia_Petri_Nets. More info at this site).

A potential paper topic: user ARM to find need information in a Petri Net
application.  Is there a set of important applications in which ARM gives you
good results much faster (better?) than the developed Petri Net software
(see the above URL for a list of some of that software)? 

A Petri net (also known as a place/transition net or P/T net) is one of several
 mathematical representations of discrete distributed systems. As a modeling
 language, it graphically depicts the structure of a distributed system as a
 directed bipartite graph with annotations. As such, a Petri net has place
 nodes, transition nodes, and directed arcs connecting places with transitions.

At any one time during a Petri net's execution, each place can hold zero or
 more tokens. Unlike more traditional data processing systems that can process
 only a single stream of incoming tokens, Petri net transitions can consume
 tokens from multiple input places, act on them, and output tokens to multiple
 output places. Before acting on input tokens, a transition waits until the
 following two conditions are met:

 * (i) a required number of tokens appears in every one of its input places, and
 * (ii) number of tokens in each of its output places is below some threshold.

Transitions act on input tokens by a process known as firing.
 When a transition fires, it consumes the tokens from its input places,
 performs some processing task, and places a specified number of tokens into
 each of its output places. It does this atomically, namely in one single
 non-preemptible step. Since more than one transition on a net can be firing
 at any one time, Petri nets are well suited for modeling concurrent behavior
 of a (geographically) distributed system.











SQL Queries in a high level language such as SQL are processed
by Horizontal DBMSs in the following steps:





1. SCAN and PARSE (SCANNER-PARSER): Scanner identifies the tokens
                                           or language elements,
                                    Parser check for syntax or
                                           grammar validity





2. VALIDATED:  Validator checks for valid names and
               semantic correctness





3. CONVERTER converts to an internal representation (usually a QUERY TREE)

             For example, given the database:
             _S______________   _C___________      _E______
            |S#|SNAME |LCODE | |C#|CNAME|SITE|    |S#|C#|GR| 
            |==|======|======| |==|=====|====|    |==|==|==|   
            |25|CLAY  |NJ5101| |8 |DSDE |ND  |    |32|8 |89|  
            |32|THAISZ|NJ5102| |7 |CUS  |ND  |    |32|7 |91| 
            |38|GOOD  |FL6321| |6 |3UA  |NJ  |    |25|7 |68|
            |17|BAID  |NY2091| |5 |3UA  |ND  |    |25|6 |76|    
            |57|BROWN |NY2092|                    |32|6 |62|   



            the non-procedural SQL request:

            SELECT S.SNAME, C.CNAME, E.GR   FROM   S,C,E 
            WHERE  S.LCODE=NJ5101 and C.SITE="ND" and E.GR=68 and
                   S.S#=E.S# and C.C#=E.C#;

            gets SCANNED, PARSED, VALIDATED and then could get
            CONVERTED to a query tree following WHERE-clause sequencing:

                      M=PROJ(L)[SNAME,CNAME,GR]
                      |
                      L=SELECT(K.GR=68)
                      |
                      K=SELECT(H.SITE="ND")
                      |
                      H=SELECT(G.LCODE="NJ5101")
                      |
                      G=JOIN(F.C#=C.C#)
                     /\
                    /  \
    JOIN(S.S#=E.S#)=F   C
                   /\
                  /  \
                 S    E

This would be the simplest CONVERTER (simply uses the ordering in WHERE clause)





Let's see results at each step (execution starts at the bottom of the tree)

              M=PROJ(L)[SNAME,CNAME,GR]_____________
              |                     |SNAME |CNAME|GR|
              |                     |CLAY  |CUS  |68|
              |
              L=SELECT(K.GR=68)_____________________________
              |            |S#|SNAME |LCODE |C#|GR|CNAME|SITE 
              |            |25|CLAY  |NJ5101|7 |68|CUS  |ND |   
              |
              K=SELECT(H.SITE="ND")_________________________
              |            |S#|SNAME |LCODE |C#|GR|CNAME|SITE 
              |            |25|CLAY  |NJ5101|7 |68|CUS  |ND |   
              |
              H=SELECT(G.LCODE="NJ5101")____________________
              |            |S#|SNAME |LCODE |C#|GR|CNAME|SITE 
              |            |25|CLAY  |NJ5101|7 |68|CUS  |ND |   
              |            |25|CLAY  |NJ5101|6 |76|3UA  |NJ |  
              |
              |
              G=JOIN(F.C#=C.C#)_______________________________ 
             / \             |S#|SNAME |LCODE |C#|GR|CNAME|SITE 
            /   \            |25|CLAY  |NJ5101|7 |68|CUS  |ND |   
           / ____C_______    |25|CLAY  |NJ5101|6 |76|3UA  |NJ |  
          / |C#|CNAME|SITE   |32|THAISZ|NJ5102|8 |89|DSDE |ND |
          | |8 |DSDE |ND |   |32|THAISZ|NJ5102|7 |91|CUS  |ND |   
          | |7 |CUS  |ND |   |32|THAISZ|NJ5102|6 |62|3UA  |NJ |  
          | |6 |3UA  |NJ |  
          | |5 |3UA  |ND | 
          \         
           \
            \
             \
              \
               \
                \
                 \
                  \
                   F=JOIN(S.S#=E.S#)______________________
                  / \              |S#|SNAME |LCODE |C#|GR|                     
                 /   \             |25|CLAY  |NJ5101|7 |68|                     
                /     \            |25|CLAY  |NJ5101|6 |76|                     
     __________S_____   __E_____   |32|THAISZ|NJ5102|8 |89|                     
     S#|SNAME |LCODE | |S#|C#|GR|  |32|THAISZ|NJ5102|7 |91|                     
     25|CLAY  |NJ5101| |32|8 |89|  |32|THAISZ|NJ5102|6 |62|                     
     32|THAISZ|NJ5102| |32|7 |91| 
     38|GOOD  |FL6321| |25|7 |68|
     17|BAID  |NY2091| |25|6 |76|    
     57|BROWN |NY2092| |32|6 |62|   



4. OPTIMIZED: Query Optimzier devises a stategy for executing query
              (chooses among alternative Query trees).






              Is the query tree above efficient (optimal)?

              Is the following tree better?


                 M=PROJ(G)[SNAME,CNAME,GR]
                 |        |CLAY |CUS  |68|
                 |
                 |
                 |
                 G=JOIN(F.C#=K.C#)
                 /\           |S#|SNAME|LCODE |C#|GR|CNAME|SITE|
                |  \          |25|CLAY |NJ5101|7 |68|CUS  |ND  |
                 \  \
                  \  \
                   \  `----------------.
                    \                   \
                     \                   \
                      \                   \
                       \                   \
                        \                   \
 _______JOIN(H.S#=L.S#)= F                   \
|S#|SNAME|LCODE |C#|GR| /\                    \
|25|CLAY |NJ5101|7 |68 /  \                    \
                      /    \                    \
.--------------------'      \                    \       
H=SELECT(S.LCODE=NJ5101)     L=SELECT(E.GR=68)   K=SELECT(C.SITE=ND)
:  |S#|SNAME |LCODE |        :  |S#|C#|GR|       :  |C#|CNAME|SITE|
:  |25|CLAY  |NJ5101|        :  |25|7 |68|       :  |8 |DSDE |ND  |
:                            :                   :  |7 |CUS  |ND  |          
:                            :                   :  |5 |3UA  |ND  |
:                            :                   :
S________________       _____E__           ______C______
|S#|SNAME |LCODE |     |S#|C#|GR|         |C#|CNAME|SITE|
|25|CLAY  |NJ5101|     |32|8 |89|         |8 |DSDE |ND  |
|32|THAISZ|NJ5102|     |32|7 |91|         |7 |CUS  |ND  |
|38|GOOD  |FL6321|     |25|7 |68|         |6 |3UA  |NJ  |
|17|BAID  |NY2091|     |25|6 |76|         |5 |3UA  |ND  |
|57|BROWN |NY2092|     |32|6 |62|   
                      






Note that the following could be done:

i.   The SITE attribute can be projected from K (doesn't require
     elimination of duplicates because it is not part of the key)

ii.  The LCODE attrib can be projected off of H (doesn't require
     elimination of duplicates because it is not part of the key)

iii. S# could be projected off of F (it is part of the key but duplicate
     elimination could be deferred until M since it will have to be done again
     there anyway - thus this projection can be a "non duplicate-eliminating"
     projection also (which we will denote by [[  ]]).  [[ ]]-projections take
     no time, whereas duplicate eliminating projections take a lot of time).

iv.  C# can be (non-duplicate-eliminating) projected off of G


(note: this projection is reordering attrs and eliminating duplicates, if any)


                                  M=PROJ(G)[SNAME,CNAME,GR]
                                  |        |CLAY |CUS  |68|
                                  |
                                  |
                                  |
                                  G=JOIN(F.C#=K.C#)
                                 / \          |SNAME|GR|CNAME|
                                /   \         |CLAY |68|CUS  |
                               /     \        |CLAY |68|CUS  |
JOIN(H.S#=L.S#)[[SNAME,C#,GR]]=F      \
|SNAME|C#|GR|                 / \      \
|CLAY |7 |68|                /   \      `--------.
                            /     \               \
                           /       \               \
.-------------------------'         \               \
H=SEL(S.LCODE="NJ5101"[[S#|SNAME]]   >               \
:    |S#|SNAME |                    /                 \
:    |25|CLAY  |          _________/                   \
:                        /                             \
:                       /                               \
:                      /                                 \
:         L=SELECT(E.GR=68)         K=SELECT(C.SITE="ND")[[C#,CNAME]]
:            |S#|C#|GR|  :             :     |C#|CNAME|
:            |25|7 |68|  :             :     |8 |DSDE |
:                        :             :     |7 |CUS  |
:                        :             :     |5 |3UA  |
:                        :             : 
S________________   _____E__        ___C_________
|S#|SNAME |LCODE | |S#|C#|GR|      |C#|CNAME|SITE|
|25|CLAY  |NJ5101| |32|8 |89|      |8 |DSDE |ND  |
|32|THAISZ|NJ5102| |32|7 |91|      |7 |CUS  |ND  |
|38|GOOD  |FL6321| |25|7 |68|      |6 |3UA  |NJ  |
|17|BAID  |NY2091| |25|6 |76|      |5 |3UA  |ND  |
|57|BROWN |NY2092| |32|6 |62|   

                      
What can be learned about query optimization from these examples?

GOOD RULES?

a. Do SELECTS first (push to the bottom of the tree)

b. Do attribute elimination part of PROJECT as soon as possible (push down)

c. Only do duplicate elimination once
        (at top-most PROJECT only or in conjunction with a latter join step)





QUERY OPTIMIZATION, then, is finding an efficient strategy to implement
 query requests (Automatically, Heuristically, not necessarily optimally)




 Note: In lower level languages, the user does the query optimization



  - by writing the procedural code to specify all steps and order those steps.
 
        (of course there are optimizing compilers that will automatically alter
        your "procedures", but still you are mostly responsible for ordering).



  - Relational queries are issued at a high level (SQL or ODBC),
    so that system has maximal oportunity to optimize them.



 HEURISTIC RULES are used to re-order the (e.g., WHERE clause ordered) query tree.

                  (e.g., RULES a. b. and c. above)

                  Some rules depend upon size and complexity estimates.




 ESTIMATION estimates the cost of different strategies and chooses best.

 The Challenge: Get acceptable performance (Note, took 10 years (1972 to 1982)
                to optimize join process acceptably so that the first viable
                Relational DBMSs could be successfully marketed).





The other steps are:



5. CODE GENERATION: generates code to implement each operator in the selected
                    query plan (the optimizer-selected the query tree).



6. RUNTIME DATABASE PROCESSORING: run plan code






The CODE GENERATION step:



IMPLEMENTING the OPERATORS listed in the query tree efficiently


There are many options for SELECT, PROJECT and JOIN implementation.



Some SELECTION implementations (a basic set):


S1. Linear search: sequentially search every record.  (required for selections
                   from an unordered relation with no index or access path)

SELECT C#, GRADE
FROM   ENROLL
WHERE  S# = 32 

.ENROLL_____. 
|S#|C#|GRADE|
|==|==|=====|
|32|8 | 89  | 
|32|7 | 91  |
|25|7 | 68  |
|25|6 | 76  |
|32|6 | 62  |
|38|6 | 98  |
|17|5 | 96  |
 ----------- 


S2. Binary search: (for selections on a clustered or ordered attribute)
                   SELECT C#, GRADE    FROM ENROLL    WHERE S# = 38 

  .ENROLL_____. 
  |S#|C#|GRADE|
  |==|==|=====|
 0|17|5 | 96  |
 1|25|7 | 68  |
 2|25|6 | 76  |
 3|32|8 | 89  | 1. Go half way (to rec #3), since target > 32,
 4|32|7 | 91  |    go half way down what's left (to record #5).
 5|34|6 | 62  | 2. Since target>34, go half way down (to rec #6)
 6|38|6 | 98  | 3. Target is a match.  Done.
   ----------- 




S3. Using indexes (or hash structures) for an equality comparison

                SELECT S#, NAME   FROM STUDENT   WHERE S# = 32 

.STUDENT_________.                           Index on S#
|S#|SNAME |LCODE | RRN's                     S#|RRN
|==|======|======|                           ==|===
|17|BAID  |NY2091| 0                         17| 0
|57|BROWN |NY2092| 1                         25| 2
|25|CLAY  |NJ5101| 2                         32| 3,4
|32|THAISZ|NJ5102| 3                         57| 1
|32|GOOD  |FL6321| 4
`----------------'




S4. Using primary index for an inequality comparison on a key (clustered).
                        (Find starting point with "=", then retrieve all
                         records beyond that point).
                         SELECT S#,NAME   FROM STUDENT   WHERE S# > 25 
.STUDENT_________.                             Primary Index on S#
|S#|SNAME |LCODE | RRN's                       S#|RRN
|==|======|======|                             ==|===
|17|BAID  |NY2091| 0                           17| 0
|25|CLAY  |NJ5101| 1                           25| 1
|32|THAISZ|NJ5102| 2                           32| 2
|38|GOOD  |FL6321| 3                           38| 3
|57|BROWN |NY2092| 4                           57| 4
`----------------'



S5. Using a clustering index for "=" comparison
                       SELECT C#, GRADE    FROM ENROLL    WHERE S# = 32 

  .ENROLL_____.        Clustering Index on ENROLL.S#
  |S#|C#|GRADE|        S#|RRN
  |==|==|=====|        ==|===
 0|17|5 | 96  |        17| 0
 1|25|7 | 68  |        25| 1
 2|25|6 | 76  |        32| 3
 3|32|8 | 89  |        38| 6
 4|32|7 | 91  |
 5|32|6 | 62  |
 6|38|6 | 98  |
   ----------- 




S6. Using a secondary B+-tree index:   For "=", use the index set.
          For ">", use the index set, then use the sequence set.

SELECT NAME,CITY    FROM STUDENT    WHERE S# = 25 
SELECT NAME,CITY    FROM STUDENT    WHERE S# > 25 
                   _______
                  |*32*38*|
                   |  |   \   
                  /   |    \_  
          _______/  __|____  \ _______
         |*20*__n| |n32*__n|  |*56*__n|  
          |  \         |       |  |_ 
          |   \        |       |    \   
          |    \       \        \    \____
       ___|_    \____   \_____   \_____   \_____
      |  |17|--|20|25|--|32|35|--|38|  |--|56|57|
      |__|2_|  |_5|_4|  |_1|_7|  |_3|__|  |_6|_0|
   .......:      :  :     :  :     :        :  :
...:.............:  :     :  :...  :        :  :
:  :  ..............:     :.... :  :        :  :
:  :  :   __________________  : :  :        :  :
:  :  : 0|57|BROWN |NY   |NY|.:.:..:........:..:
:  :  : 1|32|THAISZ|KNOB |NJ|.: :  :        :  
:  :..:.2|17|BAID  |NY   |NY|   :  :        :
:     : 3|38|GOOD  |GATER|FL|...:..:        :
:     :.4|25|CLAY  |OUTBK|NJ|   :           :
:.......5|20|JOB   |MRHD |MN|   :           :
        6|56|BURGUM|FARGO|ND|...:...........:
        7|35|BOYD  |FLAX |NE|...:


Each of S2 - S6 requires a special access path.








 SELECTION methods ANDed WHERE Clauses:


S7. Select an attribute involved in one simple condition (usually an "=")
    then check the other condition for each retrieved record.

         SELECT NAME,CITY    FROM STUDENT    WHERE S#>25 AND STATE=NE
                   _______
                  |*32*38*|
                   |  |   \   
                  /   |    \_  
          _______/  __|____  \ _______
         |*20*__n| |n32*__n|  |*56*__n|  
          |  \         |       |  |_ 
          |   \        |       |    \   
          |    \       \        \    \____
       ___|_    \____   \_____   \_____   \_____
      |  |17|--|20|25|--|32|35|--|38|  |--|56|57|
      |__|2_|  |_5|_4|  |_1|_7|  |_3|__|  |_6|_0|
   .......:      :  :     :  :     :        :  :
...:.............:  :     :  :...  :        :  : 2ary ST index
:  :  ..............:     :.... :  :        :  : ST|RRN(s)
:  :  :   __________________  : :  :        :  : ==|==
:  :  : 0|57|BROWN |NY   |NY|.:.:..:........:..: FL| 3
:  :  : 1|32|THAISZ|KNOB |NJ|.: :  :        :    MN| 5
:  :..:.2|17|BAID  |NY   |NY|   :  :        :    NE| 7
:     : 3|38|GOOD  |GATER|FL|...:..:        :    ND| 6
:     :.4|25|CLAY  |OUTBK|NJ|   :           :    NJ| 1,4
:.......5|20|JOB   |MRHD |MN|   :           :    NY| 0,2
        6|56|BURGUM|FARGO|ND|...:...........:
        7|35|BOYD  |FLAX |NE|...:







S8. INTERSECTION OF RECORD POINTERS: Intersect RRN-sets then retrieve records

    SELECT NAME,CITY FROM STUDENT WHERE S#>25 AND (STATE=NE or STATE=NY);

S#-RRN-list      ST-RRN-list                intersection
1,7,3,6,0        0,2,7                      0,7






S9.  If there are Bitmapped Indexes, used in datawarehouses, where the
     data doesn't change except at infrequent pre-determined points in time
     (DW is "static"):

 BMI on ST
 ST|bit-filter
 ==|=========
 FL| 00010000
 MN| 00001000
 NE| 00000001
 ND| 00000010
 NJ| 01001000
 NY| 10100000
                  
 S#|bit-filter
 ==|=========
 17| 00100000
 20| 00000100
 25| 00001000
 32| 01000000  OR here to end (S# > 25) resulting in: 11010011
 35| 00000001  OR NE, NY bitfilters:                  10100001
 38| 00010000  AND the two for the result:            10000001
 56| 00000010
 57| 10000000


Note BitMapped Indexes (BMIs) are used only for "low cardinality" attrs in DWs.
               (those with a small domain - ie, only a few possible values)


The reason is that for low-cordinality domains (eg, MONTH, STATE, GENDER, etc.)
 BMI has few entries (rows) and each bitmap is quite dense (many 1-bits
 To see why this is so, consider two extreme cases.




CASE-1:  A GENDER attribute in a relation with 80,000 tuples.  BMI looks like:

GENDER|bit-filter
======+=================
Female|0111001010100...1
Male  |1000110101011...0



Each bitfilter is 80,000 bits or 10KB so the index is ~20KB
 - in fact with only two values, "Male" entry is unnecessary since it can be
   calculated from "Female" bitfilter as the bit-compliment.
 - Thus, the index is only ~10KB in size altogether.



If a regular index were used:

GENDER|RID-list
======+=================
Female|RID-F1, RID-F2, ..., RID-Fn
Male  |RID-M1, RID-M2, ..., RID-Mn'

n+n' = 80,000;  If RIDs take even 8 bytes (often bigger) the size is ~640KB



Thus the BMI size could be as low as    ~10KB
and the regular index size would be    ~640KB





CASE-2: SSN attr of employee file for large company (say, with 80,000 employees)

BMI:

SSN        |bit-filter
===========+=================
324-66-9870|1000000000000...0
...
475-77-5432|0000000000100...0
...
687-99-2536|0000000000000...1
 ^
 `--extant domain (only those SSN's of existing employees)


Each bitfilter 80Kb (10KB) so the index is 80,000 * ~10KB or ~800MB in size.

If a regular index were used:

SSN        |RID-list
===========+=================
324-66-9870|RID-1
...
475-77-5432|RID-11
...
687-99-2536|RID-80000

If RIDs take 8 bytes and SSN+separators take another 12 bytes,
        the size is ~20*80,000 bits = ~200KB

Thus the BMI size could be as low as    ~800,000KB
and the regular index size would be         ~200KB






S10. If there is a composite index on the attrs involved in condition,
                                   use it.  If a composite hash function, use it





Selection implementation is matter of choosing among these
alternatives (and possibly others?).







 SELECTION methods when there is a WHERE disjuntion (OR) in the condition




If there is no access path (indexes or hash fctns), use S1 (brute force).




If there are access paths , use them and UNION the results,
                                             or UNION the RID-sets,
                                                then get the records
                                                 (rather than interesection as
                                                 in the case of AND condition).



If there are BitMaps, take the OR of BitMaps, then get records 









We often just use "join" instead of always saying "equijoin" since almost all
   joins are equijoins in practice.






JOIN METHODS:




J1. NESTED LOOP:

R JOIN S on R.A=S.B      ( R |X|       S )
                                R.A=S.B

For each record, t in R, (outer or driver relation),
 retrieve every record, s from S, (inner relation),
 test join condition, if it's true,
 concatenate the tuples (project off unwanted columns) and output,
 else go to next inner-relation record.


SELECT SNAME,C#,GRADE  FROM STUDENT,ENROLL WHERE STUDENT.S#=ENROLL.S#

  (i.e., create individual report cards for the students)

.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|25|CLAY  |NJ5101| <-
|32|THAISZ|NJ5102|
|38|GOOD  |FL6321|
|17|BAID  |NY2091|
|57|BROWN |NY2092|
`----------------'

.ENROLL_____.
|S#|C#|GRADE|
|==|==|=====|
|32|8 | 89  |
|32|7 | 91  |
|25|7 | 68  | <-  output CLAY,7,68
|25|6 | 76  | <-  output CLAY,6,76
|32|6 | 62  |
|38|6 | 98  |
|17|5 | 96  |
 ------------




Move outer loop cursor down:

.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|25|CLAY  |NJ5101| 
|32|THAISZ|NJ5102| <-
|38|GOOD  |FL6321|
|17|BAID  |NY2091|
|57|BROWN |NY2092|
`----------------'

.ENROLL_____.
|S#|C#|GRADE|
|==|==|=====|
|32|8 | 89  | <-  output THAISZ,8,89
|32|7 | 91  | <-  output THAISZ,7,91
|25|7 | 68  |
|25|6 | 76  |
|32|6 | 62  | <-  output THAISZ,6,68
|38|6 | 98  |
|17|5 | 96  |
 ------------




Move outer loop cursor down:

.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|25|CLAY  |NJ5101| 
|32|THAISZ|NJ5102|   
|38|GOOD  |FL6321| <-
|17|BAID  |NY2091|
|57|BROWN |NY2092|
`----------------'

.ENROLL_____.
|S#|C#|GRADE|
|==|==|=====|
|32|8 | 89  |
|32|7 | 91  |
|25|7 | 68  |
|25|6 | 76  |
|32|6 | 62  |
|38|6 | 98  | <-  output GOOD,6,98
|17|5 | 96  |
 ------------




Move outer loop cursor down:

.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|25|CLAY  |NJ5101| 
|32|THAISZ|NJ5102|   
|38|GOOD  |FL6321| 
|17|BAID  |NY2091| <-
|57|BROWN |NY2092|
`----------------'

.ENROLL_____.
|S#|C#|GRADE|
|==|==|=====|
|32|8 | 89  |
|32|7 | 91  |
|25|7 | 68  |
|25|6 | 76  |
|32|6 | 62  |
|38|6 | 98  |
|17|5 | 96  | <-  output BAID,5,96
 ------------



Move outer loop cursor down:

.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|25|CLAY  |NJ5101| 
|32|THAISZ|NJ5102|   
|38|GOOD  |FL6321| 
|17|BAID  |NY2091|
|57|BROWN |NY2092| <-
`----------------'

.ENROLL_____.
|S#|C#|GRADE|
|==|==|=====|
|32|8 | 89  |
|32|7 | 91  |
|25|7 | 68  |
|25|6 | 76  |
|32|6 | 62  |
|38|6 | 98  |
|17|5 | 96  | no match, no output
 ------------






J2. When there is anIndex on one join attribute the join can be done
    in one pass (called Indexed Nested Loop:


If there is an index on S.B,
   get r in R, get matching S-tuples using access path
   (need not scan all of ENROLL each time as was necessary with nested-loop)

  .ENROLL_____.    Dense Index on ENROLL.S#
  |S#|C#|GRADE|    S#|RRN
  |==|==|=====|    ==|===
 0|17|5 | 96  |    17| 0
 1|25|7 | 68  |    25| 1,2
 2|25|6 | 76  |    32| 3,4,6
 3|32|8 | 89  |    38| 5
 4|32|7 | 91  |
 5|38|6 | 98  |
 6|32|6 | 62  |
   ----------- 

.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|25|CLAY  |NJ5101|<-get E-RRNs, 1,2, output 25,CLAY,7,68
|32|THAISZ|NJ5102|                      and 25,CLAY,6,76
|38|GOOD  |FL6321| 
|17|BAID  |NY2091|
|57|BROWN |NY2092|
`----------------'



  .ENROLL_____.    Dense Index on ENROLL.S#
  |S#|C#|GRADE|    S#|RRN
  |==|==|=====|    ==|===
 0|17|5 | 96  |    17| 0
 1|25|7 | 68  |    25| 1,2
 2|25|6 | 76  |    32| 3,4,6
 3|32|8 | 89  |    38| 5
 4|32|7 | 91  |
 5|38|6 | 98  |
 6|32|6 | 62  |
   ----------- 


.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|25|CLAY  |NJ5101|
|32|THAISZ|NJ5102|<-get E-RRN 3,4,6 output THAISZ,8,89
|38|GOOD  |FL6321|                     and THAISZ,7,91
|17|BAID  |NY2091|                     and THAISZ,6,62
|57|BROWN |NY2092|
`----------------'



  .ENROLL_____.    Dense Index on ENROLL.S#
  |S#|C#|GRADE|    S#|RRN
  |==|==|=====|    ==|===
 0|17|5 | 96  |    17| 0
 1|25|7 | 68  |    25| 1,2
 2|25|6 | 76  |    32| 3,4,6
 3|32|8 | 89  |    38| 5
 4|32|7 | 91  |
 5|38|6 | 98  |
 6|32|6 | 62  |
   ----------- 

.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|25|CLAY  |NJ5101|
|32|THAISZ|NJ5102|
|38|GOOD  |FL6321| <- get E-RRN, 5 and output GOOD,6,98
|17|BAID  |NY2091|
|57|BROWN |NY2092|
`----------------'


  .ENROLL_____.    Dense Index on ENROLL.S#
  |S#|C#|GRADE|    S#|RRN
  |==|==|=====|    ==|===
 0|17|5 | 96  |    17| 0
 1|25|7 | 68  |    25| 1,2
 2|25|6 | 76  |    32| 3,4,6
 3|32|8 | 89  |    38| 5
 4|32|7 | 91  |
 5|38|6 | 98  |
 6|32|6 | 62  |
   ----------- 

.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|25|CLAY  |NJ5101|
|32|THAISZ|NJ5102|
|38|GOOD  |FL6321| 
|17|BAID  |NY2091| <- get E-RRN, 0 and output BAID,5,96
|57|BROWN |NY2092|
`----------------'


  .ENROLL_____.    Dense Index on ENROLL.S#
  |S#|C#|GRADE|    S#|RRN
  |==|==|=====|    ==|===
 0|17|5 | 96  |    17| 0
 1|25|7 | 68  |    25| 1,2
 2|25|6 | 76  |    32| 3,4,6
 3|32|8 | 89  |    38| 5
 4|32|7 | 91  |
 5|38|6 | 98  |
 6|32|6 | 62  |
   ----------- 

.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|25|CLAY  |NJ5101|
|32|THAISZ|NJ5102|                          
|38|GOOD  |FL6321| 
|17|BAID  |NY2091|
|57|BROWN |NY2092| <- no matches.
`----------------'







J3. MERGE JOIN:
    If both R.A and S.B are ordered
       (i.e., R is clustered on A and S is clustered on B),
    then scan both R and S in order.

  .ENROLL_____.  
  |S#|C#|GRADE|
  |==|==|=====|    
 0|17|5 | 96  |<-
 1|25|7 | 68  | 
 2|25|6 | 76  |
 3|32|8 | 89  |
 4|32|7 | 91  |
 5|32|6 | 62  |
 6|38|6 | 98  |
   ----------- 

.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|17|BAID  |NY2091|<- output BAID,5,96
|25|CLAY  |NJ5101|
|32|THAISZ|NJ5102|                          
|38|GOOD  |FL6321| 
|57|BROWN |NY2092|
`----------------'



Move cursors down:

  .ENROLL_____.  
  |S#|C#|GRADE|
  |==|==|=====|    
 0|17|5 | 96  |  
 1|25|7 | 68  | <-
 2|25|6 | 76  | <-
 3|32|8 | 89  |
 4|32|7 | 91  |
 5|32|6 | 62  |
 6|38|6 | 98  |
   ----------- 

.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|17|BAID  |NY2091|
|25|CLAY  |NJ5101| <- output CLAY,7,68
|32|THAISZ|NJ5102|           CLAY,6,76
|38|GOOD  |FL6321| 
|57|BROWN |NY2092|
`----------------'



Move cursors down:

  .ENROLL_____.  
  |S#|C#|GRADE|
  |==|==|=====|    
 0|17|5 | 96  |
 1|25|7 | 68  | 
 2|25|6 | 76  |
 3|32|8 | 89  |<-
 4|32|7 | 91  |<-
 5|32|6 | 62  |<-
 6|38|6 | 98  |
   ----------- 

.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|17|BAID  |NY2091|
|25|CLAY  |NJ5101|
|32|THAISZ|NJ5102|<- output THAISZ,8,89
|38|GOOD  |FL6321|          THAISZ,7,91
|57|BROWN |NY2092|          THAISZ,6,62
`----------------'



Move cursors down:

  .ENROLL_____.  
  |S#|C#|GRADE|
  |==|==|=====|    
 0|17|5 | 96  |
 1|25|7 | 68  | 
 2|25|6 | 76  |
 3|32|8 | 89  |
 4|32|7 | 91  |
 5|32|6 | 62  |
 6|38|6 | 98  |<-
   ----------- 

.STUDENT_________.
|S#|SNAME |LCODE |
|==|======|======|
|17|BAID  |NY2091|
|25|CLAY  |NJ5101|
|32|THAISZ|NJ5102|                          
|38|GOOD  |FL6321| <- output GOOD,6,98
|57|BROWN |NY2092|





J3'. SORT-MERGE JOIN: If R.A and S.B are not ordered, sort them first
                         (into R' clustered on A and S' clustered on B),
                      then apply MERGE (J2 above).






J4. HASH-JOIN:   RIDs are hashed to hash buckets (pages).
                  Corresponding buckets are retrieved and scanned.




GRACE JOIN: (first example of a hash-join technique):



Allocate M pages of memory to the join process.


Partition the M pages as follows:

  - One page for putting new pages as they are read from disk (called INPUT),
    B+1 for hash buckets R0,..,RB    ( therefore B = M-2 ).

Use hash function, h with range, {0,...,B}
                   (e.g., MOD(B) if A is numeric)

PHASE-0:
          _____________ 
      .--|             |
      :  |__IN_________|
  h(A)` >|             |
      :  |__R0_________|
      ` >|             |
      .  |__R1_________|
      ` >|             |
      :  |__R2_________|
      :  |  .          |
      :  |  .          |
      :  |_____________|
      ` >|             |
         |__RB_________|





Partial Sort Phase:


Partial-Sort-R:

Read each R-page to IN,
  hash each record using h(A) to R0,...RB.

Upon collision in any of the buckets (pages), R0..RB,
     flush it to temporary disk file (also called R0,...RB).






Partial-Sort-S:

Read each S-page into IN,
  hash each record with h(A) to R0..RB

Upon collision in any bucket, R0..RB,
     flush its' contents to temp disk file, S0..SB

.--------.                   .--------.
|        |                   |        |
|        |                   |        |
|        |                   |        |
`R-------'                   `-S------'
               |
          _____v_______ 
      .< |__IN_________|
   h(A) >|__R0_________| >.____. .____.
      ` >|__R1_________|  |    | |    |
      ` >|__R2_________|  |_R0_| |_S0_|
      :  |  . . .      |     :     :
      :  |_____________|  .____. .____.
      ` >|__RB_________| >|    | |    |
                          |_RB_| |_SB_|






Build Phase:

With each pair of temporary files, R0 & S0,  R1 & R2,  R2 & S2,...
     in turn, do as follows:

Re-partition memory into IN, OUT and one large hash area.

For Ri, BUILD internal a hash table in the hash area using
 another hash function, k(a)

FOR Si, PROBE hash table using k(a) for matches,
              output join of matches to OUT.

.----.                    .----.
|    |                    |    |
`Ri--'                    `Si--'
               |
          _____v_______ 
      .< |__IN_________|
      :  |             |
      `-k(A) >         |
         |             |
         |             |
         |             |
         |             |
         |__R0_________|
         |__OUT________|



Example (for simplicity we assume 1 rec/pg), M=4, h(S#)=MOD3(S#)

.S=ENROLL___.  
|S#|C#|GRADE|
|==|==|=====|    
|25|6 | 76  |
|32|8 | 89  |
|32|7 | 91  |
|17|5 | 96  |
|32|6 | 62  |
|38|6 | 98  |
|25|7 | 68  | 

.R=STUDENT_______.
|S#|SNAME |LCODE |
|==|======|======|
|57|BROWN |NY2092|
|25|CLAY  |NJ5101|
|32|THAISZ|NJ5102|                          
|17|BAID  |NY2091|
|38|GOOD  |FL6321|


Main_Memory_(5_pgs)
-IN-vv-------------

-R0-vv-------------

-R1-vv-------------

-R2-vv-------------

-------------------


Partially-Sort-R:

.R=STUDENT_______.
|S#|SNAME |LCODE |
|==|======|======|
|57|BROWN |NY2092| > .
|25|CLAY  |NJ5101|   :
|32|THAISZ|NJ5102|   :                       
|17|BAID  |NY2091|   :
|38|GOOD  |FL6321|   :
                     :
                     :
Main_Memory_(5_pgs)  :
-IN-vv-------------  :
|57|BROWN |NY2092| < ' h(57)=0-.
-R0-vv-------------            :
|57|BROWN |NY2092| < ----------'
-R1-vv-------------

-R2-vv-------------

-------------------


.R=STUDENT_______.
|S#|SNAME |LCODE |
|==|======|======|
|57|BROWN |NY2092|
|25|CLAY  |NJ5101| >.
|32|THAISZ|NJ5102|  :                       
|17|BAID  |NY2091|  :
|38|GOOD  |FL6321|  :
                    :
                    :
Main_Memory_(5_pgs) :
-IN-vv------------- :
|25|CLAY  |NJ5101|< ' h(25)=1.
-R0-vv-------------          :
|57|BROWN |NY2092|           :
-R1-vv-------------          :
|25|CLAY  |NJ5101| < --------'
-R2-vv-------------

-------------------



.R=STUDENT_______.
|S#|SNAME |LCODE |
|==|======|======|
|57|BROWN |NY2092|
|25|CLAY  |NJ5101|
|32|THAISZ|NJ5102| >.
|17|BAID  |NY2091|  :
|38|GOOD  |FL6321|  :
                    :
Main_Memory_(5_pgs) :
IN----------------  :
|32|THAISZ|NJ5102|< ' h(32)=2.
R0----------------           :
|57|BROWN |NY2092|           :
R1----------------           :
|25|CLAY  |NJ5101|           :
R2----------------           :
|32|THAISZ|NJ5102| < --------'
------------------



.R=STUDENT_______.
|S#|SNAME |LCODE |
|==|======|======|
|57|BROWN |NY2092|
|25|CLAY  |NJ5101|
|32|THAISZ|NJ5102|
|17|BAID  |NY2091| >.
|38|GOOD  |FL6321|  :
                    :
Main_Memory_(5_pgs) :
IN----------------  :
|17|BAID  |NY2091|< ' h(17)=2.
R0----------------           :
|57|BROWN |NY2092|           :
R1----------------           :
|25|CLAY  |NJ5101|           :                        
R2----------------           :          .R2--------------.
|17|BAID  |NY2091| < OVFLW---' Flush 32 |32|THAISZ|NJ5102|
------------------



.R=STUDENT_______.
|S#|SNAME |LCODE |
|==|======|======|
|57|BROWN |NY2092|
|25|CLAY  |NJ5101|
|32|THAISZ|NJ5102|
|17|BAID  |NY2091|
|38|GOOD  |FL6321| >.
                    :
Main_Memory_(5_pgs) :
IN----------------  :
|38|GOOD  |FL6321|< ' h(38)=2.
R0----------------           :
|57|BROWN |NY2092|           :
R1----------------           :
|25|CLAY  |NJ5101|           :                        
R2----------------           :          .R2--------------.
|38|GOOD  |FL6321| < OVFLW---' Flush 17 |32|THAISZ|NJ5102|
------------------                      |17|BAID  |NY2091|



Flush all buffers to files, R0, R1, R2 on DISK-.
                                               :
Main_Memory_(5_pgs)                            v
IN----------------   
                                  
R0----------------               R0-----------------.
                                   |57|BROWN |NY2092|            
R1----------------               R1-----------------.
                                   |25|CLAY  |NJ5101|
R2----------------               R2-----------------.
                                   |32|THAISZ|NJ5102|
------------------                 |17|BAID  |NY2091|
                                   |38|GOOD  |FL6321|




Partially-Sort-S:

.S=ENROLL___.  
|S#|C#|GRADE|
|==|==|=====|    
|25|6 | 76  | >.
|32|8 | 89  |  :
|32|7 | 91  |  :
|17|5 | 96  |  :
|32|6 | 62  |  :
|38|6 | 98  |  :
|25|7 | 68  |  :
               :
Main_Memory)   :
IN-----------  :
|25|6 | 76  |< ' h(25)=1.
R0-----------           :          R0-----------------.
                        :            |57|BROWN |NY2092|            
R1-----------           :          R1-----------------.
|25|6 | 76  | < - - - - '            |25|CLAY  |NJ5101|
R2-----------                      R2-----------------.
                                     |32|THAISZ|NJ5102|
-------------                        |17|BAID  |NY2091|
                                     |38|GOOD  |FL6321|


.S=ENROLL___.  
|S#|C#|GRADE|
|==|==|=====|    
|25|6 | 76  |
|32|8 | 89  | >.
|32|7 | 91  |  :
|17|5 | 96  |  :
|32|6 | 62  |  :
|38|6 | 98  |  :
|25|7 | 68  |  :
               :
Main_Memory)   :
IN-----------  :
|32|8 | 89  |< ' h(32)=2.
R0-----------           :          R0-----------------.
                        :            |57|BROWN |NY2092|            
R1-----------           :          R1-----------------.
|25|6 | 76  |           :            |25|CLAY  |NJ5101|
R2-----------           :          R2-----------------.
|32|8 | 89  | < - - - - '            |32|THAISZ|NJ5102|
-------------                        |17|BAID  |NY2091|
                                     |38|GOOD  |FL6321|

.S=ENROLL___.  
|S#|C#|GRADE|
|==|==|=====|    
|25|6 | 76  |
|32|8 | 89  |
|32|7 | 91  | >.
|17|5 | 96  |  :
|32|6 | 62  |  :
|38|6 | 98  |  :
|25|7 | 68  |  :
               :
Main_Memory)   :
IN-----------  :
|32|7 | 91  |< ' h(32)=2.
R0-----------           :          R0-----------------.
                        :            |57|BROWN |NY2092|            
R1-----------           :          R1-----------------.
|25|6 | 76  |           :            |25|CLAY  |NJ5101|
R2-----------           :          R2-----------------.   S2------------.
|32|7 | 91  |< OVFL- - -' Flush      |32|THAISZ|NJ5102|     |32|8 | 89  |
-------------             32,8 to    |17|BAID  |NY2091|
                          disk 1st   |38|GOOD  |FL6321|     ^
                                 \_________________________/


.S=ENROLL___.  
|S#|C#|GRADE|
|==|==|=====|    
|25|6 | 76  |
|32|8 | 89  |
|32|7 | 91  |
|17|5 | 96  | >.
|32|6 | 62  |  :
|38|6 | 98  |  :
|25|7 | 68  |  :
               :
Main_Memory)   :
IN-----------  :
|17|5 | 96  |< ' h(17)=2.
R0-----------           :          R0-----------------.
                        :            |57|BROWN |NY2092|            
R1-----------           :          R1-----------------.
|25|6 | 76  |           :            |25|CLAY  |NJ5101|
R2-----------           :          R2-----------------.   S2------------.
|17|5 | 96  |< OVFL -- -' Flush      |32|THAISZ|NJ5102|     |32|8 | 89  |
-------------             32,7 to    |17|BAID  |NY2091|     |32|7 | 91  |
                          disk 1st   |38|GOOD  |FL6321|      
                                 \_________________________/


.S=ENROLL___.  
|S#|C#|GRADE|
|==|==|=====|    
|25|6 | 76  |
|32|8 | 89  |
|32|7 | 91  |
|17|5 | 96  |
|32|6 | 62  | >.
|38|6 | 98  |  :
|25|7 | 68  |  :
               :
Main_Memory)   :
IN-----------  :
|32|6 | 62  |< ' h(32)=2.
R0-----------           :          R0-----------------.
                        :            |57|BROWN |NY2092|            
R1-----------           :          R1-----------------.
|25|6 | 76  |           :            |25|CLAY  |NJ5101|
R2-----------           :          R2-----------------.   S2------------.
|32|6 | 62  |< OVFL- - -' Flush      |32|THAISZ|NJ5102|     |32|8 | 89  |
-------------             17,5 to    |17|BAID  |NY2091|     |32|7 | 91  |
                          disk 1st   |38|GOOD  |FL6321|     |17|5 | 96  |
                                 \_________________________/


.S=ENROLL___.  
|S#|C#|GRADE|
|==|==|=====|    
|25|6 | 76  |
|32|8 | 89  |
|32|7 | 91  |
|17|5 | 96  |
|32|6 | 62  |
|38|6 | 98  | >.
|25|7 | 68  |  :
               :
Main_Memory)   :
IN-----------  :
|38|6 | 98  |< ' h(38)=2.
R0-----------           :          R0-----------------.
                        :            |57|BROWN |NY2092|            
R1-----------           :          R1-----------------.
|25|6 | 76  |           :            |25|CLAY  |NJ5101|
R2-----------           :          R2-----------------.   S2------------.
|38|6 | 98  |< OVFL- - -' Flush      |32|THAISZ|NJ5102|     |32|8 | 89  |
-------------             32,6 to    |17|BAID  |NY2091|     |32|7 | 91  |
                          disk 1st   |38|GOOD  |FL6321|     |17|5 | 96  |
                                 \______________________.-> |32|6 | 62  |



.S=ENROLL___.  
|S#|C#|GRADE|
|==|==|=====|    
|25|6 | 76  |
|32|8 | 89  |
|32|7 | 91  |
|17|5 | 96  |
|32|6 | 62  |
|38|6 | 98  |
|25|7 | 68  | >.
               :
Main_Memory)   :
IN-----------  :                _______________________
|25|7 | 68  |< ' h(25)=1.      /                       \
R0-----------           :     /    R0-----------------. \
                        :Flush       |57|BROWN |NY2092|   v          
R1-----------           :25,6 to   R1-----------------.   S1------------.
|25|7 | 68  |< OVFL- - -'disk 1st    |25|CLAY  |NJ5101|     |25|6 | 76  |
R2-----------                      R2-----------------.   S2------------.
|38|6 | 98  |                        |32|THAISZ|NJ5102|     |32|8 | 89  |
-------------                        |17|BAID  |NY2091|     |32|7 | 91  |
                                     |38|GOOD  |FL6321|     |17|5 | 96  |
                                                            |32|6 | 62  |

FLUSH all buffers:

Main_Memory)  
IN-----------  
                                                      
R0-----------                      R0-----------------. 
                                     |57|BROWN |NY2092|   S1------------.
R1-----------                      R1-----------------.     |25|6 | 76  |
                                     |25|CLAY  |NJ5101|     |25|7 | 68  |
R2-----------                      R2-----------------.   S2------------.
                                     |32|THAISZ|NJ5102|     |32|8 | 89  |
-------------                        |17|BAID  |NY2091|     |32|7 | 91  |
                                     |38|GOOD  |FL6321|     |17|5 | 96  |
                                                            |32|6 | 62  |
                                                            |38|6 | 98  |




BUILD PHASE:

First, since S0 is an empty file on disk, skip the R0-S0 pair.

R1-S1: BUILD internal hash table in TBL from R1 using hash
             function, k(S#)=MOD2

Main_Memory
IN------------------
|25|CLAY  |NJ5101|  < Read from R1
TBL-----------------       \       R0-----------------. 
0                           \        |57|BROWN |NY2092|   S1------------.
 |------------------         \     R1-----------------.     |25|6 | 76  |
1                             `----  |25|CLAY  |NJ5101|     |25|7 | 68  |
OUT-----------------               R2-----------------.   S2------------.
                                     |32|THAISZ|NJ5102|     |32|8 | 89  |
--------------------                 |17|BAID  |NY2091|     |32|7 | 91  |
                                     |38|GOOD  |FL6321|     |17|5 | 96  |
                                                            |32|6 | 62  |
                                                            |38|6 | 98  |


Main_Memory
IN------------------
|25|CLAY  |NJ5101|  --k(25)=1.
TBL-----------------         :     R0-----------------. 
0                            :       |57|BROWN |NY2092|   S1------------.
 |------------------         :     R1-----------------.     |25|6 | 76  |
1|25|CLAY  |NJ5101|< --------'       |25|CLAY  |NJ5101|     |25|7 | 68  |
OUT-----------------               R2-----------------.   S2------------.
                                     |32|THAISZ|NJ5102|     |32|8 | 89  |
--------------------                 |17|BAID  |NY2091|     |32|7 | 91  |
                                     |38|GOOD  |FL6321|     |17|5 | 96  |
                                                            |32|6 | 62  |
                                                            |38|6 | 98  |





Probe S1 (one S-page at a time):

Main_Memory
IN------------------
|25|6 | 76  |       <  - - - - - - - - - - - - - - - - -.
TBL-----------------               R0-----------------. :
0                                    |57|BROWN |NY2092| : S1------------.
 |------------------               R1-----------------. `-- |25|6 | 76  |
1|25|CLAY  |NJ5101|                  |25|CLAY  |NJ5101|     |25|7 | 68  |
OUT-----------------               R2-----------------.   S2------------.
                                     |32|THAISZ|NJ5102|     |32|8 | 89  |
--------------------                 |17|BAID  |NY2091|     |32|7 | 91  |
                                     |38|GOOD  |FL6321|     |17|5 | 96  |
                                                            |32|6 | 62  |
                                                            |38|6 | 98  |

Main_Memory
IN------------------
|25|6 | 76  |       -k(25)=1.
TBL-----------------        :      R0-----------------. 
0                           :        |57|BROWN |NY2092|   S1------------.
 |------------------        :      R1-----------------.     |25|6 | 76  |
1|25|CLAY  |NJ5101| -match >:        |25|CLAY  |NJ5101|     |25|7 | 68  |
OUT-----------------        :      R2-----------------.   S2------------.
|CLAY,6 | 76  |     <  - - -'        |32|THAISZ|NJ5102|     |32|8 | 89  |
--------------------                 |17|BAID  |NY2091|     |32|7 | 91  |
                                     |38|GOOD  |FL6321|     |17|5 | 96  |
                                                            |32|6 | 62  |
                                                            |38|6 | 98  |


Main_Memory  (repeat for 2nd S1 tuple:)
IN----------------- < - - - - - - - - - - - - - - - - -.
|25|7 | 68  |       -k(25)=1.                           \
TBL-----------------        :      R0-----------------. :
0:                          :        |57|BROWN |NY2092| : S1------------.
 |------------------        :      R1-----------------. :   |25|6 | 76  |
1|25|CLAY  |NJ5101| -match- :        |25|CLAY  |NJ5101| `-- |25|7 | 68  |
OUT-----------------        :      R2-----------------.   S2------------.
|CLAY,7 | 68  |     k(32)=0   :                        
TBL-----------------    :       :   R0-----------------.  
 0|32|THAISZ|NJ5102| < -'       :     |57|BROWN |NY2092|   S1------------.
  |-----------------            :   R1-----------------.     |25|6 | 76  |
 1                              :     |25|CLAY  |NJ5101|     |25|7 | 68  |
OUT-----------------            :   R2-----------------.   S2------------.
|CLAY,7 | 68  |                 ` < - |32|THAISZ|NJ5102|     |32|8 | 89  |
--------------------                  |17|BAID  |NY2091|     |32|7 | 91  |
                                      |38|GOOD  |FL6321|     |17|5 | 96  |
                                                             |32|6 | 62  |
RESULT________                                               |38|6 | 98  |
|CLAY,6 | 76  |                        



IN------------------ < - - - - -.                          
|17|BAID  |NY2091| - >k(17)=1   :                        
TBL-----------------    :       :   R0-----------------.  
 0|32|THAISZ|NJ5102|    :       :     |57|BROWN |NY2092|   S1------------.
  |-----------------    :       :   R1-----------------.     |25|6 | 76  |
 1|17|BAID  |NY2091| < -'       :     |25|CLAY  |NJ5101|     |25|7 | 68  |
OUT-----------------            :   R2-----------------.   S2------------.
|CLAY,7 | 68  |                 :     |32|THAISZ|NJ5102|     |32|8 | 89  |
--------------------            ` < - |17|BAID  |NY2091|     |32|7 | 91  |
                                      |38|GOOD  |FL6321|     |17|5 | 96  |
                                                             |32|6 | 62  |
RESULT________                                               |38|6 | 98  |
|CLAY,6 | 76  |                        



IN------------------ < - - - - -.                          
|38|GOOD  |FL6321| - >k(38)=0   :                        
TBL-----------------    :       :   R0-----------------.  
 0|32|THAISZ|NJ5102|k(38)=0   :                        
TBL-----------------    :       :
 0|32|THAISZ|NJ5102|< OVFL      :
  |-----------------    :       :
 1|17|BAID  |NY2091|    :       :
OUT-----------------    :       :   R2-----------------.   S2------------.
|CLAY,7 | 68  |         :       :     |32|THAISZ|NJ5102|     |32|8 | 89  |
OVFL----------------    :       :     |17|BAID  |NY2091|     |32|7 | 91  |
|38|GOOD  |FL6321|  < _ '       ` < - |38|GOOD  |FL6321|     |17|5 | 96  |
--------------------                                         |32|6 | 62  |
RESULT________                                               |38|6 | 98  |
|CLAY,6 | 76  |                        



PROBE S2:

IN------------------ < - - - - -.                          
|32|8 | 89  |      - >k(32)=0   :                        
TBL-----------------    :       :
 0|32|THAISZ|NJ5102|< match     :
  |-----------------    :       :
 1|17|BAID  |NY2091|    :       :
OUT-----------------    :       :   S2------------.
 THAISZ,8,89        < flush     ` - - |32|8 | 89  |
OVFL----------------  1st             |32|7 | 91  |
|38|GOOD  |FL6321|      :             |17|5 | 96  |
--------------------    :             |32|6 | 62  |
RESULT________          :             |38|6 | 98  |
 CLAY,6,76              :              
 CLAY,7,68         < - -'


IN------------------ < - - - - -.                          
|32|7 | 91  |      - >k(32)=0   :                        
TBL-----------------    :       :
 0|32|THAISZ|NJ5102|< match     :
  |-----------------    :       :
 1|17|BAID  |NY2091|    :       :
OUT-----------------    :       :   S2------------.
 THAISZ,7,91        < flush     :     |32|8 | 89  |
OVFL----------------  1st       ` - - |32|7 | 91  |
|38|GOOD  |FL6321|      :             |17|5 | 96  |
--------------------    :             |32|6 | 62  |
RESULT________          :             |38|6 | 98  |
 CLAY,6,76              :              
 CLAY,7,68              :
 THAISZ,8,89       < - -'


IN------------------ < - - - - -.                          
|17|5 | 96  |      - >k(17)=1   :                        
TBL-----------------    :       :
 0|32|THAISZ|NJ5102|    :       :
  |-----------------    match   :
 1|17|BAID  |NY2091|< - '       :
OUT-----------------    :       :   S2------------.
   BAID,5,96        < flush     :     |32|8 | 89  |
OVFL----------------  1st       :     |32|7 | 91  |
|38|GOOD  |FL6321|      :       ` - - |17|5 | 96  |
--------------------    :             |32|6 | 62  |
RESULT________          :             |38|6 | 98  |
 CLAY,6,76              :              
 CLAY,7,68              :
 THAISZ,8,89            :
 THAISZ,7,91       < - -'



IN------------------ < - - - - -.                          
|32|6 | 62  |      - >k(32)=0   :                        
TBL-----------------    :       :
 0|32|THAISZ|NJ5102|< match     :
  |-----------------    :       :
 1|17|BAID  |NY2091|    :       :
OUT-----------------    :       :   S2------------.
 THAISZ,6,62        < flush     :     |32|8 | 89  |
OVFL----------------  1st       :     |32|7 | 91  |
|38|GOOD  |FL6321|      :       :     |17|5 | 96  |
--------------------    :       ` - - |32|6 | 62  |
RESULT________          :             |38|6 | 98  |
 CLAY,6,76              :              
 CLAY,7,68              :
 THAISZ,8,89            :
 THAISZ,7,91            :
 BAID,5,96         < - -'


IN------------------ < - - - - -.                          
|38|6 | 98  |      - >k(38)=0   :                        
TBL-----------------    :       :
 0|32|THAISZ|NJ5102|< no match  :
  |-----------------    :       :
 1|17|BAID  |NY2091|    :       :
OUT-----------------    :       :   S2------------.
 GOOD,6,98          < flush     :     |32|8 | 89  |
OVFL----------------    :       :     |32|7 | 91  |
|38|GOOD  |FL6321|  < match     :     |17|5 | 96  |
--------------------    :       :     |32|6 | 62  |
RESULT________          :       ` - - |38|6 | 98  |
 CLAY,6,76              :              
 CLAY,7,68              :
 THAISZ,8,89            :
 THAISZ,7,91            :
 BAID,5,96              :
 THAISZ,6,62       < - -'




IN------------------                                       

TBL-----------------            
 0|32|THAISZ|NJ5102|
  |-----------------            
 1|17|BAID  |NY2091|            
OUT-----------------                S2------------.
 GOOD,6,98          > flush           |32|8 | 89  |
OVFL----------------    :             |32|7 | 91  |
|38|GOOD  |FL6321|      :             |17|5 | 96  |
--------------------    :             |32|6 | 62  |
RESULT________          :             |38|6 | 98  |
 CLAY,6,76              :              
 CLAY,7,68              :
 THAISZ,8,89            :
 THAISZ,7,91            :
 BAID,5,96              :
 THAISZ,6,62            :
 GOOD,6,98         < - -'






Is there a better way?  (There is always a better way!)




Hybrid hash Join (developed by a former member of NDSU CS):


Partition the M pages of main memory allocated to the join process as:

One page for the INPUT buffer, One page for the OUTPUT buffer,
B pages for hash buckets, 1,..,B, the rest for hash bucket 0.

          _____________ 
      .--|             |
      :  |__IN_________|
  h(A)   |             |
      :  |             |
      ` >|             |
      :  |             |
      :  |             |
      :  |__R0_________|
      ` >|             |
      .  |__R1_________|
      ` >|             |
         |__R2_________|
      .  |  .          |
      `->|  .          |
         |__RB_________|
         |             |
         |__OUT________|

Read each R page to IN, hash each record using h(A) to R0..RB. If record hashes
to R0, apply internal hash function, k(A), copy record to corresp. slot in R0.

When collision in any pages R1..RB, flush to temp disk file, also called R1..RB.

Read each S page into IN, hash each record with h(A) to R0..RB

If record hashes to R0, apply internal hash function, k(A),
   and concatenate record with all matches found to OUT.

If Collision occurs in any page, R1..RB, flush to temporary disk file, S1..SB

.--------.                   .--------.
|        |                   |        |
|        |                   |        |
|        |                   |        |
`R-------'                   `-S------'
               |
          _____v_______ 
      .< |__IN_________|
   h(A)  |             |
      `-k(A)->         |
      :  |             |
      :  |__R0_________|  .____. .____.
      ` >|__R1_________| >|    | |    |
      ` >|__R2_________|  |_R1_| |_S1_|
      :  |  . . .      |     :     :
      :  |_____________|  .____. .____.
      ` >|__RB_________| >|    | |    |
         |__OUT________|  |_RB_| |_SB_|

2nd, 3rd,... PHASES:
Start process over with pairs, R1 & S1.
Then R2 & S2, etc.


PHASE-2:
.----.                    .----.
|    |                    |    |
`R1--'                    `S1--'
               |
          _____v_______ 
      .< |__IN_________|
      :  |             |
      `-k(A) >         |
         |             |
         |             |
         |             |
         |             |
         |             |
         |             |
         |__R0_________|
         |__OUT________|





Hybrid Hash Join with Bit Filtering: (to eliminate non-participating tuples
    early - avoid wasted processing of non-participating tuples).

PHASE 1 BUILD on R

.--------.       
|        |      
|        |     
|        |    
`R-------'......
               :
          _____v_______ 
.< h(A)< |__IN_________|
:        |             |
:        |             |   
:        |             |   
:        |             |   
:        |             |   
`>i0 >0..0             |   
:     FR0|             |
:      ` >k(a) >       |
:        |             |
:        |__________R0_|  .____.
`>i1 >1..0 >________R1_| >|    |
`>    FR1|  . . .      |  |_R1_|
`>       |_____________|  .____.
`>iB >0..1 >________RB_| >|    |
      FRB|__OUT________|  |_RB_|

Build Bit Filters FR0...FRB for R0...RB
    using hash functions, i0,...iB
    to be used to filter out some non-particpants
    in S during the PHASE 1 Probe:

PHASE 1 PROBE on S

.--------.       
|        |      
|        |     
|        |    
|        |    
|        |    
`S-------'......
               :
          _____v_______ 
.< h(A)< |__IN_________|
:        |             |
`>i0->.  |             |   
:     :  |             |   
`>iB->:  |             |   
......'  |             |   
:        |             |   
` >k(a) >|             |
:        |__________R0_|  .____.
`>j1 >1..0 >________R1_| >|_S1_|
`>    FS1   . . .               
`>       |_____________|  .____.
`>jB >0..1 >________RB_| >|_SB_|
      FSB|__OUT________|         

Build Filters, FS1...FSB for S0...SB
     using hash fctns, j1...jB
     to be used to filter out some non-particpants
     in R1..RB during the PHASE 2 builds:


PHASE-2.n build (n=1,2,...B):
.----.                    .____.
|    |                    |_Sn_|
`Rn--'..........                
               :
          _____v_______ 
.----- < |__IN_________|
:        |             |
`>jn >.  |             |   
......'  |             |   
:        |             |   
>in' >0..0             |   
     FRn'|             |
       ` >k(a) >       |
         |             |
         |             |
         |             |
         |             |
         |             |
         |__R0_________|
         |__OUT________|

Build Filter, FRn' for Rn using hash
functions, in' to be used to filter out
some non-particpants in Sn during the

PHASE 2.n Probe (n=1,2,...B):

.----.                    .____.
|    |                    |_Sn_|
`Rn--'         ...........:     
               :
          _____v_______ 
.----- < |__IN_________|
:        |             |
`>in'>.  |             |   
......'  |             |   
:        |             |   
` >k(a) >|             |
         |             |
         |             |
         |             |
         |             |
         |             |
         |__R0_________|
         |__OUT________|









PROJECTION: Projection is removal of certain attributes
        (columns) from a relation to produce the output result.

Given R(A,B,C,D,E), the projection onto A,B,D written,
PROJ(R)[A,B,D] is done by removing columns C and E and
then eliminating duplicated rows (tuples) from the result.
( A,B,D is called the projection attribute-list ),

Note: many system provide more flexible projection (duplicates not removed.)
Strictly speaking, not a relational operator since the result is not a relation.

If the attr-list contains a key, then there are no duplicates.
In this case, get each tuple, trim off unspecified attributes.

If list does not contain a key, sort (or hash) and then
 get a record,
 trim off non-attr-list attributes and
 eliminate duplicates.

Note: this can be expensive (about the same complexity as a join


Projection (with duplicate elimination) are similar to join techniques.


eg, nested loop,
For each tuple, scan the projection for duplicates.

  - Since there is a physical order to the tuples
     (even though it may not be any particular logical ordering)
     consider the tuples in that order and then scan from that
     tuple only (not from the beginning of the file).


Indexed nested loop,
For each tuple, consult index for duplicates and remove them.


Sort-remove (like sort-merge),
     Sort projection, then scan once for duplicates
     (now situated adjacent to each other) and remove


Hash methods.
Partially sort projection by hashing
          (similar to grace join??)
          then use reduced nested loop on each subset
          (one at a time) to discard duplicates.





Most common type of query is a Select-Project-Join (SPJ) query.

eg, SELECT C,D from R(A,B,C),S(E,A,D) WHERE R.A=S.A and B > 100



METHODS for fast SPJ processing:

MATERIALIZED VIEWS (MV),
DOMAIN VECTORS (DV)
JOIN INDICES  (JI),
others...





MATERIALIZED VIEW method is just a matter of precomputing the
query result and storing it for the next request of that query
(so that it does not have to be recomputed each time).

 - this may work well if the result is not too large and if
   the underlying base relations from which the view is
   generated are quite "static" (changed very seldom).











DOMAIN VECTORS make SPJ query processing more efficient.


In the Domain Vector approach,
    we "surrogate" values in a domain.

 Given domain, D,
 (eg, D={all 3 letter strings} for name field) for a field, A

Assign a number to each domain values with a
Domain Value Table (DVT):

DVT:
nam|surrogate
===|=========
aaa|0
aab|1
aac|2
...
aaz|25
...
zzz|17575


Then each attribute, R.A, has a Domain Vector:

DV(R.A) = (0010100100110...0) with a 1-bit in the
 nth position iff the Domain Value assigned surrogate, n,
 occurs as the R.A value in some tuple of R.


   - think if the DVT as a perfect hash function
   - think if DV(R.A) as a perfect filter on R.A



DV(CUSTOMER.nam) =
(0...1000000000010...010...010...0)
     ^          ^     ^     ^
1886-'          |     |     `13395
              1897   3289    SUE
             "JAY"   "JON"

(eg, JAN is 1886th domain value  or  has surrogate 1886)


Basic idea of the Domain Vector Accelerator method is:

Store DVs for join attributes.

To accelerate a join, R JOIN S on A, compute

  DV(R.A) AND DV(S.A)  (called the "join vector",
                        denoted JV(R JOIN S on A) )

  This JV specifies which join-attribute values
  are  common in both relations, R and S.

Using this information as a perfect bit filter the
 hybrid hash join method may be optimized.
           _____                         _____
   DV(R.A)|_____| - - -> AND <- - - - - |_____|DV(S.A)
                          v     
                      .------.
                      |__JV__|

      _R______                                _S______
     | |      |                              |    | | |
     | |      |>- .                          |    | | |
     | |      |   .                          |    | | |
     | |      |   .                          |    |A| |
     |A|______|   .                          |    | | |
                  .                          |____|_|_|
.------.          .
|__JV__|<-.   ____v___
    :     ` -|_IN_____|
    `- ->.-R0|>k(A)-> |           __R1____  __S1____
         .   |Internal|          |        ||        |
         .   | Hash   |        ->|________||________|
         .   | Table  |      '- ->        ` - - ` - ->.
         .   |________|     '     __R2____  __S2____  .
         .-R1|________|- - '     |        ||        | .
         .-R2|________|- - - - ->|________||________| .
         .   | .      |                   ` - - ` - ->:
         .   | .      |                               .
         .   |________|           __RB____  __SB____  .
         `>RB|________|- - - - ->|        ||        | .
             |__OUT___|          |________||________| .
                                          ` - - ` - ->:
         __IN____                                     :
        |________|     < - - - - - - - - - - - - - - -
        .
        k'(A)
         .   .--R0----.                     _OUT____
          `->|        |                    |________|
             |        | - > - - - - - - - '
             |________|




Also if there are indexes which translate these values
 surrogates (1-bit positions in JV) to the RIDs of the
 records that contain the values, then these indexes can
 be used to optimize the initial file accesses
 (only read pages that have participants)

Note, to reduce the size of these vectors we can just
 surrogate the "extant domain"  (currently existing domain
 values) then as new ones appear they can be assigned the
 next available surrogate.




*** Comparison of Hybrid Hash and Domain Vector Join *****

Figure 1: Hybrid Hash Join with bit-filtering via ji(A),Ji(A)
partitioning via h(A), internal-R0 hashing via k(A) and k'(A)

Assuming file, R, is smaller than file S.


    _R______  #1#  __IN____       #4#  _S______
   | |      |- -> |________|<- - - - -|    | | |
   | |      |      :                  |    | | |
   |A|______|      :                  |    | | |
                   v      OUT____     |____|A|_|
                  h(A)   |_______|                     
                  :        :
              #2# :   .--------.               .- -> - #7# - .
              #5# :-R0|-k(A)-> |         __R1__:_  __S1____  :
                  :   |Internal|        |        ||________| :
                  :   | Hash   | .-#3#->|________|  ^  :     :
		  :   | Table  | : .-J1(A)->[Sbf1*]-'  `#9#->:
Rbitfliters       :   |________| : #6#                       :
[Rbf1***]<-j1(A)<-:-R1|________|-'-'                         :
[Rbf2***]<-j2(A)<-:-R2   .                                   :
                  :      .         .-JB(A)->[SbfB*].         :
                  :    __._____    :     __RB____  v_SB____  :
[RbfB***]<-jB(A)<-`>RB|________|- - - ->|        ||________| :
                                        |________|     :     :
                                                 ` - - ` - ->:
                  __IN____                                   :
                 |________|   < -build with either Ri or Si -'
             #9# :  : #7.5#       use Ji & Sbfi to filter Ri
                 :  :
                 :  `Ji(A)->[Sbfi**]
                 :<- - - - - '
                 h(A)
                 :    .--R0----.                   _OUT____
            #8#  `- ->|ki(A)   |                  |________|
            #10#      |________| - > - - - - - - '



DV-Hash-Join (Note that the Request Vector, RV, is created to
 minimize paging at 1.6 and 4.6.

The special indexes, DVI(R.A) and DVI(S.A) translate
 A-surrogates to the RIDs of the records containing
 the corresponind A-values.

The RIDs can be used as lists or turned into bit vectors,
    RV(R|XS) and RV(S|XR) to completely optimize reading.

 (similar to the bit-vectors of BMI indexes)


           _____         #1#             _____
   DV(R.A)|_____| - - -> AND <- - - - - |_____|DV(S.A)
                          v     
                      .------.
                      |__JV__|
  . - - - - #1.2- - - -'   ` - - -#4.2 - .
  :/  _R______                           :/   _S______
  /  | |      |   #2#                    /   |    | | |
 /   | |      |>- - - -- .              /    |    | | |
 DVI | |      |           .             DVI  |    | | |
 \   | |      | #1.6      .             \    |    |A| |
  \  |A|______|<---.      .              \   |    | | |
   `.               .     .               \  |    | | |
    .               .     .                \ |____|_|_|
#1.4.               .     .          #4.4 .'  ^  .
    .  .___________.'     .  .___________v    :  .       
    `->|__RV(R|XS)_|      .  |__RV(S|XR)_|- -'   .
                         .                #4.6   .#4.8
                        .                      ..'
                      .< - - - - - - - - - - -'
                     .
           .--------'
          .
         .
         v
         h(A)           __OUT___
         .        .- ->|________|
         .   .--------.
         .-R0|>k(A)-> |           __R1____  __S1____
         .   |Internal|          |        ||        |
         .   | Hash   |        ->|________||________|
         .   | Table  |      '- ->        ` - - ` - ->.
         .   |________| #3# '#5#  __R2____  __S2____  .
         .-R1|________|- - '     |        ||        | .
         .-R2|________|- - - - ->|________||________| .
         .     .                          ` - - ` - ->:
         .     .                                      .
         .    ________            __RB____  __SB____  .
         `>RB|________|- - - - ->|        ||        | .
                                 |________||________| .
                                          ` - - ` - ->:
                                           #6#   #8#  .
         __IN____                                     :
        |________|     < - - - - - - - - - - - - - - -
        .
        k'(A)
         .   .--R0----.                     _OUT____
   #7#    `->|        |                    |________|
   #9#       |        | - > - - - - - - - '
             |________|




vertical select-project-join









QUERY PROCESSING IN DISTRIBUTED DBMS (DDBMS)



Desirable Features of a Distributed DBMS:



LOCATION TRANSPARENCY is achieved if a user can access needed data
without having to know which site has that data.


   -simplifies logic of programs

   -allows data movement as usage patterns change





A data object (typically a file) is fragmented if it is divided into multiple
pieces for storage and/or placement purposes at different sites.

 e.g., accounts files: Fargo customer accounts can be stored in Fargo,
                 Grand Forks customer accounts can be stored in Grand Forks...)


FRAGMENTATION TRANSPARENCY is achieved if users can access needed
data without having to know whether it is fragmented.







a data object (typically a record or file) is REPLICATED if it has ≥ 1 physical copy

   -distributed replication advantages include availability
   -disadvantages include increased update overhead


REPLICATION TRANSPARENCY is achieved if users can access needed
data without knowing whether or not it is replicated.








Additional desireable DDBMS features include:




LOCAL AUTONOMY is achieved if the system is distributed
   consistent with the logical and physical distribution of the enterprise.

  It allows local control over local data,
  It allows  local accountability and less dependency on remote Data Processing





Support for INCREMENTAL GROWTH AVAILABILITY and RELIABILITY.

  Distributed systems can more easily allow for graceful
  (and unlimited) growth simply by adding additional sites.

   The DDBMS software should allow for adding sites easily.

   Reliability can be provided by replicating data.

   The DDBMS should allow for replication to enhance
   reliability and availability in the presence of failures of sites or links.






DISTRIBUTED QUERIES

Query Optimization Methods can be

STATIC: strategy of transmissions and local processing
        activities is fully determined before execution
        begins  (at compile time) 

DYNAMIC: Each step is decided after seeing results of previous steps.




Response time usually is dominated by transmission costs

   (i.e., local processing times are negligible by comparison - assumed 0?).



One model is to take

RESPONSE time to be linear in number of bytes, X, sent:

   R(X) = AX + B

     B is the fixed (setup?) cost of the transmission and

    AX is the variable cost (depending on message size only, not distance).

        What assumptions does this make? network delays




STATIC, QUERY PROCESSING ALGORITHM

usually takes as input:

             database statistics such as
             relation sizes
             attribute sizes
             projected sizes of attributes    

produces as output:

             a strategy for answering the query

            (a pattern of what transmissions to make, when, where and
             what local processing to  do, when and where)



Usually involves 4 phases:


LOCAL PROCESSING phase: do all processing that can be done initially at
                 each site that doesn't require data interchange between sites.

                 (e.g., local selections, joins and projections)

                 The end result this phase is that there will be
                 one participating relation at each participating site.


REDUCTION phase: selected "semijoins" to be done to reduce the size of
           participating relations by eliminating tuples that are not needed
           in answering the query.



TRANSPORT phase: send one relation from each participating site
          (the result of the reduction phase) to the querying site.



COMPLETION phase: finishing up processing using those relations to get
            final answer (e.g., final projects, selects, joins)





What is the SEMIJOIN of R1(A,B) to R2(A,C) on A?  (written: R1 A→ R2 )

1.  projection R1[A]

2.  R1[A] A→ R2  (select those tuples of R2 participating in join)







For example,

 STUDENT-FILE 
.________________. 
|S#|SNAME |LCODE |
|==|======|======|
|25|CLAY  |NJ5101| 
|32|THAISZ|NJ5102|
|38|GOOD  |FL6321| 
|17|BAID  |NY2091|
|57|BROWN |NY2092|
`----------------'

 ENROLL FILE
.___________.   
|S#|C#|GRADE|  
|==|==|=====|
|32|8 | 89  |    
|32|7 | 91  |   
|32|6 | 62  |  
|38|6 | 98  | 
 -----------

ENROLL S#→ STUDENT

1. project ENROLL onto the S# attribute:
 S# 
|32|
|38|

2.   Join the two relations on S#

 S#        S# SNAME  LCODE  
|32|      |25|CLAY  |NJ5101| 
|38| join |32|THAISZ|NJ5102|
          |38|GOOD  |FL6321| 
          |17|BAID  |NY2091|
          |57|BROWN |NY2092|

resulting in:

 S# SNAME  LCODE  
|32|THAISZ|NJ5102|
|38|GOOD  |FL6321| 



Note:
The result of R1 A→ R2 is the subrelation of R2 of only those tuples which will
 participate in the full join of R1 JOIN R2 on A

 (eliminates non-participants at the cost of sending the R1-join attributes)



A semijoin can be viewed as a special SELECTION operator also, since it
 selects out those tuples of R2 that have a matching A-value in R1.


Thus the semijoin is perfect for reducing the size of
   relations before they are sent to the querying site.





But note that semijoins don't always end up reducing the size of a relation.



Consider STUDENT S#→ ENROLL

Project STUDENT onto the S# attribute and join it with ENROLL:

 S#        S# C# GRADE  
|25| join |32|8 | 89  |    
|32|      |32|7 | 91  |   
|38|      |32|6 | 62  |  
|17|      |38|6 | 98  | 
|57|

resulting in:

 S# C# GRADE  
|32|8 | 89  |    
|32|7 | 91  |   
|32|6 | 62  |  
|38|6 | 98  | 
 -----------

which is identical to ENROLL
   ("so let's make it a rule" - never semijoin the primary key to a foreign key
    because it will always result in no reduction)







Distributed semijoin of R1 at site 1 to R2 at site 2 along A

1. projection R1[A]

2. transmission of R1[A] to the R2-site.

3. R1[A] A-join R2  (select R2-tuples that participate in join)






Consider the following distributed query:



Assume  R1 is at site 1 and R2 is at site 2 and
        the Query arrives at site 3 is:


SELECT R1.A2, R2.A2
  FROM R1, R2
 WHERE R1.A1 = R2.A1


At site 1:                           At site 2:

R1: A1 A2 A3 A4 A5 A6 A7 A8 A9       R2 A1 A2
    a  A  A  B  C  C  E  A  F           d   1
    a  C  D  D  E  A  A  B  B           e   2
    b  A  B  C  D  B  A  B  A           g   3
    c  D  D  B  B  A  C  A  C
    e  E  B  A  A  C  C  D  D



Assume response time for transmission of X bytes between any 2 sites is

               R(X) = X + 10        time units.




Strategy-1:    (No reduction phase).

1. Send R1 to site 3: 45 bytes sent. Cost is R(45)=45+10 = 55

2. Send R2 to site 3:  6 bytes sent. Cost of R(6)=  6+10 = 16

3. Final join (cost = 0) gives eEBAACCDD2.  Response time= 71





Strategy 1': If 1,2 are done in parallel,   Response time= 55




Strategy 2 (Using reduction: R2 A1→ R1)

1. Project:  R2[A1] = d e g                      COST=  0.

2. Send R2[A1] to site 1:                R(3) = 3+10 = 13.

3. Do R2 A1→ R1 giving:  eEBAACCDD       COST=  0.

4. Send reduced R1 to site 3:            R(9) = 9+10 = 19.

5. Send R2 to site 3:                    R(6) = 6+10 = 16.

6. Final join gives eEBAACCDD2                    Resp time 48





Strategy 2': If 2,5 done in parallel,             Resp time 32




So clearly the reduction phase can reduce response time of query



For static algorithms, the hard part is to decide at site 3 what strategy to
 use without knowing exactly what the data looks like at the other two sites.



There is a need to estimate the results of above, since the  actual results
 are not known in advance at site 3.


That estimation method is important, because the situation can
be very different than the above.


From Same query as above but different data:

At site 1:                             site 2
R1: A1 A2 A3 A4 A5 A6 A7 A8 A9       R2 A1 A2
    d  A  A  B  C  C  E  A  F           d   1
    d  C  D  D  E  A  A  B  B           e   2
    e  A  B  C  D  B  A  B  A           g   3
    g  D  D  B  B  A  C  A  C
    e  E  B  A  A  C  C  D  D

Then Strategy 2 would be:
1. Project:  R2[A1] = d,e,g           COST= 0.
2. Send R2[A1] to site 1:    R(3) = 3+10 = 13.
3. R2 A1-semijoin R1: dAABCCEAF   
                      dCDDEAABB   
                      eABCDBABA   
                      gDDBBACAC   
                      eEBAACCDD       COST= 0.
4. Send reduced R1 to site 3 R(45)=45+10 = 55.
5. Send R2 to site 3:        R(6) = 6+10 = 16.
6. Final join dAABCCEAF1
              dCDDEAABB1
              eABCDBABA2 
              gDDBBACAC3 
              eEBAACCDD1     Response time 84.

In fact, Strategy 1 would be better for this data situation.




Question: "How should reduction phase results be estimated?"





SELECTIVITY THEORY (Hevner, Yao) assumes data values are uniformly distributed
 and attribute-distributions are independent of each other.


Results estimated as follows: (assuming A1 has domain {a,b,...z})


The Selectivity of attributed R1 is the ratio of the number of values present
 (size of the extant domain) over the number of values possible (size of full
 domain) Therefore the selectivity of R1.A is 3/26


Using selectivity theory, we estimate the size of the
  semijoin, R1 A-semijoin R2 as:

(Original size of R2) * (selectivity of incoming attr, R1.A)
   or  45*3/26= 5.2


Selectivity theory estimates 5.2 bytes of R1 survive semijoin.

This is close for the first example database state and the
 algorithm proposed by Hevner & Yao (ALGORITHM-GENERAL)
 would correctly select method 1.

However, it is way off in the second database state but
 ALGORITHM-GENERAL would still select strategy-1
 (not the best for this database state).



UPDATE PROPAGATION: To update any replicated data item,
                    the DDBMS must propagate the new
                    value consistently to all copies.



IMMEDIATE method: update all copies
         (the update fails if even 1 copy is unavailable)


PRIMARY   method: designate 1 copy as primary for each item.
        Update is deemed complete (COMMITTED) when primary copy
              is updated.
        Primary copy site is responsible for broadcasting the
               update to the other sites.
        Broadcast can be done in parallel while the transaction
                 is contining, however that
                 runs counter to local autonomy theme



















APPENDIX:

Another join acceleration method in use today is

JOIN INDICES (JI):

Join indices are designed for complex SPJ queries and
parallel processing situations on large memory systems.

The properties of JI include:
  efficient use of memory,
  compatibility with other operations (select, union),



As discussed previously, record can be "surrogated"
             (given a number alias) in at least two ways:
1. Record IDentifiers (RIDs)  (usually  page,offset )
2. Relative Record Numbers (RRNs)
   (usually arrival order numbering)


In what follows, we let,
ri=surrogate for ith record of R,
sj=surrogate for jth record of S.



Consider the join, T = R.A JOIN S.B.

The JOIN INDEX (JI) for the join, T,
    above is  JI = {(ri,sj)| ri.A=sj.B}


Following is example using members-only
    buying club (eg, SAM's Club).


R=CUSTOMER
 ri nam loc age job
 -- --- --- --- ----
  1 JAY LA   25 STAR
  2 JON LA   45 HOOD
  3 JAN RO   93 DOCT
  4 SUE RO   16 PRGR

S=PURCHASE              JOIN INDEX
 si nam pur qty cost      ri  sj
 -- --- --- --- ----      --  --
  1 JON AK47  9  21       1   3
  2 SUE PS/2  3  68       2   1
  3 JAY SAAB  2 320       2   4
  4 JON BOMB  1 999       4   2


To produce MONTHLY billing statements we need the
JOIN-INDEX (JI) ordered on ri or we need an index on JI.ri


If JI is ordered on ri, then
   sequence thru R,
     for each ri,
     advance thru JI for matches.
   read S-record(s) by corresponding sj.


If JI is not ordered on ri, but there is an index on JI.ri,
   sequence thru R,
     for each ri,
     lookup it up in JI for corresponding sj's
   read S-record(s) by corresponding sj.



 R=CUSTOMER                     S=PURCHASE
 ri nam loc age job             sj nam pur qty cost
 -- --- --- --- ----            -- --- --- --- --- 
  1 JAY LA   25 STAR            1 JON AK47  9   21
  2 JON LA   45 HOOD            2 SUE PS/2  3   68
  3 JAN RO   93 DOCT            3 JAY SAAB  2  320
  4 SUE RO   16 PRGR            4 JON BOMB  1  999
                   
JI-index        JI   
   .1 3         ji-RRN  ri   sj
  6              1      2    1     
 / `2 1,4        2      4    2
5                3      1    3
 `4 2            4      2    4 




The semijoin of R.A to S.B, R.A-->S.B,
 results the subset of S consisting of
 all tuples which particpates in join R.A-JOIN-S.B.

It can be created many ways.
It can be created as projection
of R onto A followed by join of R[A].A to S.B

R.A-->S.B  =  R[A].A  JOIN  S.B


The semijoin R.nam-->S.nam query tree is
    (rotated on its side clockwise 90 degrees)

  R                            
 ri nam pur qty     R[nam]
 -- --- --- ---     ---
  1 JON AK47  9 --> JON
  2 SUE PS/2  3     SUE \           R.nam-->S.nam
  3 JAY SAAB  2     JAY  \          sj nam loc age job
  4 JON BOMB  1           \JOIN-->  -- --- --- --- ---
                          /          1 JAY LA   25 STAR
  S                      /           2 JON LA   45 HOOD
 sj nam loc age job     /            4 SUE RO   16 PRGR
  1 JAY LA   25 STAR   /
  2 JON LA   45 HOOD  /
  3 JAN RO   93 DOCT /
  4 SUE RO   16 PRGR