Database Design =  finding a "good" logical structure for the database.



   So far, design criteria has been just maintaining well-defined relations
                                     and choosing good primary keys.




Review of some definitions:


CANDIDATE KEY is a set of attributes, K from the schema which,
              at all times, satisfies:

     UNIQUENESS: no 2 distinct tuples have same K-value,
     MINIMALITY: None of Ai..Ak can be discarded and still maintain uniqueness.




PRIMARY KEY  = one candidate key designated primary


ALTERNATE KEY= non-primary candidate key


SUPERKEY     = attribute superset of a candidate key


NORMALIZATION= DB design activities to preclude operational anomalies,
                      and provide good key properties for projections

     Normalization classes are (more exist):

        1st Normal Form (1NF)

        2nd Normal Form (2NF)

        3rd Normal Form (3NF)

        BCNF Boyce Codd Normal Form (BCNF)

        4th Normal Form (4NF) ...





The defining conditions get stronger going down this list and therefore
the sets of  qualifying relations gets smaller going down this list. 



By defining conditions     4NF ⇒ BCNF ⇒ 3NF ⇒ 2NF ⇒ 1NF

By classes of relations    4NF ⊆ BCNF ⊆ 3NF ⊆ 2NF ⊆ 1NF







Given a relation R with attributes X and Y  (possibly composite)



R.Y is FUNCTIONALLY DEPENDENT (FD) on R.X    equivalently,

   R.X FUNCTIONALLY DETERMINES R.Y           written   R.X → R.Y

       iff  (x,y1) and (x,y2) ∈ R[X,Y]  implies y1=y2.





NOTES:

All attributes are FD on any attribute with the uniqueness property.


Functional Dependencies are like integrity constraints.

  They are stipulated to hold (all tuples for all times) by designers

  They can't be determined simply by observing the data state at a particular time.

  They are quite different from Association Rules.

       (ARs are approximate dependencies that hold at a given confidence level
            over a given subset at a particular time






For composite attribute, X            (composed of more than 1 attribute)

R.Y is FULLY FUNCTIONALLY DEPENDENT (FFD) on R.X

   R.X FULLY FUNCTIONALLY DETERMINES R.Y            written R.X ⇒ R.Y

    iff R.X → R.Y  but  R.Z not→ R.Y  for any proper subset Z ⊂ X






Said another way;

R.X is a candidate key (has uniqueness and minimality)  for R[X,Y]




These are SEMANTIC notions.  One needs to know how data is used or what is
                             intended (ask people who created/use data!)







NORMAL FORMS (initially assuming only one candidate key)




First Normal Form (1NF) = no repeating groups 
                                 (all attribute values are atomic)


        Allowing repeating groups in an attribute creates a situation in which
        the key does not determine a value in that attribute
        (but possibly many values)


We can use a memory technique, that 1NF means:


 note: THIS IS A MEMORY TECHNIQUE, NOT A DEFINITION
       (i.e., don't use it on the comprehensive!)

 every nonkey attribute is functionally dependent on key




This EDUCATION1 file is not in First Normal Form (1NF)

S# SNAME  CHILDREN LCODE  LSTATUS C# CNAME SITE GR
32|THAISZ|Ed,Jo,Ty|NJ5102|  1    | 8| DSDE| ND |89
25|CLAY  |Ann     |NJ5101|  1    | 7| CUS | ND |68
32|THAISZ|Ed,Jo,Ty|NJ5102|  1    | 7| CUS | ND |91
25|CLAY  |Ann     |NJ5101|  1    | 6| 3UA | NJ |76
32|THAISZ|Ed,Jo,Ty|NJ5102|  1    | 6| 3UA | NJ |62





This EDUCATION2 file is in First Normal Form (1NF)

S# SNAME  LCODE  LSTATUS C# CNAME SITE GR
32|THAISZ|NJ5102|  1    | 8| DSDE| ND |89
25|CLAY  |NJ5101|  1    | 7| CUS | ND |68
32|THAISZ|NJ5102|  1    | 7| CUS | ND |91
25|CLAY  |NJ5101|  1    | 6| 3UA | NJ |76
32|THAISZ|NJ5102|  1    | 6| 3UA | NJ |62





How do you get EDUCATION1 into 1NF?

1. Create a separate CHILDREN file:

    CHILD S#
   |Ed   |32|
   |Jo   |32|
   |Ty   |32|
   |Ann  |25|


or

2. Assuming a maximum number of children, say 3:

S# SNAME  CHILD1 CHILD2 CHILD3 LCODE  LSTATUS C# CNAME SITE GR
32|THAISZ|Ed    |Jo    |Ty    |NJ5102|  1    | 8| DSDE| ND |89
25|CLAY  |Ann   |      |      |NJ5101|  1    | 7| CUS | ND |68
32|THAISZ|Ed    |Jo    |Ty    |NJ5102|  1    | 7| CUS | ND |91
25|CLAY  |Ann   |      |      |NJ5101|  1    | 6| 3UA | NJ |76
32|THAISZ|Ed    |Jo    |Ty    |NJ5102|  1    | 6| 3UA | NJ |62






Second Normal Form (2NF) = 1NF and every nonkey attribute is
                    fully functionally dependent on the primary key.

 every nonkey attribute is functionally dependent on whole key




Why do we need (want) relations to be in 2NF?

    1NF relations which are not 2NF present PROBLEMS (anomalies):



S# SNAME  LCODE  LSTATUS C# CNAME SITE GR
32|THAISZ|NJ5102|  1    | 8| DSDE| ND |89
25|CLAY  |NJ5101|  1    | 7| CUS | ND |68
32|THAISZ|NJ5102|  1    | 7| CUS | ND |91
25|CLAY  |NJ5101|  1    | 6| 3UA | NJ |76
32|THAISZ|NJ5102|  1    | 6| 3UA | NJ |62


INSERT ANOMALY:
       Can't record JONES' LCODE until he takes a course.


DELETE ANOMALY:
       Delete 1st record (e.g., THAISZ drops DSDE), loose C#=8 is DSDE in ND


UPDATE ANOMALY:
       Change SITE of C# from NJ to ND search sequentially for all C#=6






E.g., In the EDUCATION1 file, with key (S#,C#), FD's which are not FFD are:

  (S#,C#) → SNAME
  (S#,C#) → LCODE
  (S#,C#) → LSTATUS
  (S#,C#) → CNAME
  (S#,C#) → SITE




We make these FD's into FFD's by breaking (projecting)
   the file into 3 files. (puts relations in 2NF)


STUDENTS = EDUCATION1[S#,SNAME,LCODE,LSTATUS]
ENROLL   = EDUCATION1[S#,C#,GRADE]
COURSE   = EDUCATION1[S#,CNAME,SITE]



STUDENTS________________.   .ENROLL_____.  .COURSE_______.
|S#|SNAME |LCODE |LSTATUS|  |S#|C#|GRADE|  |C#|CNAME|SITE|
|==|======|======|=======|  |==|==|=====|  |==|=====|====|
|25|CLAY  |NJ5101|  1    |  |32|8 | 89  |  |8 |DSDE |ND  |
|23|THAISZ|NJ5102|  1    |  |32|7 | 91  |  |7 |CUS  |ND  |
|38|GOOD  |FL6321|  4    |  |25|7 | 68  |  |6 |3UA  |NJ  |
|17|BAID  |NY2091|  3    |  |25|6 | 76  |  |5 |3UA  |ND  |
|57|BROWN |NY2092|  3    |  |32|6 | 62  |  


STUDENTS is in 2NF since it has a single attribute as key (S#)

COURSE is 2NF since it has a single attribute as key (C#)

ENROLL is 2NF since GRADE is FFD on (S#,C#)




Still, we have problems:

INSERT ANOMALY: Can't record LSTATUS of LOC=ND2987 until a student from ND2987 registers.

DELETE ANOMALY: Deleting THAISZ, loose LCODE=NJ5102 has LSTATUS=1

UPDATE ANOMOLY: Change LSTATUS=3 to 2 and LSTATUS=4 to 3 
                must search STUDENTS sequentially.

                (so LSTATUS=2 isn't skipped!)




The problem is: we have a transitive dependency: S# → LCODE → LSTATUS 

 i.e., FD which doesn't involve key (or part of key): LCODE → LSTATUS



A more common transitive dependency is "City-State":

COURSE
 C# CNAME CTY ST 
|8 |DSDE |Mot|ND|
|7 |CUS  |Mot|ND|
|6 |3UA  |Bay|NJ|
|5 |3UA  |Mot|ND|


Delete 3rd record (cancel course C#=6) loose fact that Bay is in NJ, etc.





Third Normal Form (3NF): 2NF and every nonkey attr is non-transitively dependent on primary key.

  (there are no transitive dependencies).

  every nonkey attribute is functionally dependent on nothing but key



We need to project

    STUDENT ≡ STUDENTS[S#,SNAME,LCODE]  and
   LOCATION ≡ STUDENTS[LCODE,LSTATUS]

 STUDENT              LOCATION           ENROLL          COURSE
 S# SNAME  LCODE      LCODE  LSTATUS     S# C# GRADE     C# CNAME SITE
|25|CLAY  |NJ5101    |NJ5101| 1         |32|8 | 89      |8 |DSDE |ND  
|23|THAISZ|NJ5102    |NJ5102| 1         |32|7 | 91      |7 |CUS  |ND  
|38|GOOD  |FL6321    |FL6321| 4         |25|7 | 68      |6 |3UA  |NJ  
|17|BAID  |NY2091    |NY2091| 3         |25|6 | 76      |5 |3UA  |ND  
|57|BROWN |NY2092    |NY2092| 3         |32|6 | 62   
 


In summary, the memory scheme to remember 3NF:

(not a definition!  Just a memory scheme)


every nonkey attribute is functionally dependent upon
 the key                    1NF
 the whole key and          2NF
 nothing but the key        3NF
 so help me Codd" (E.F. Codd invented relational model and normal forms)


This is an analogy based on the way in which witnesses are sworn into legal proceedings in the US:
 Do you swear to tell the truth, the whole truth
  and nothing but the truth, so help you God?"




DETERMINANT = any attribute on which some other attribute is functionally dependent.

The "The key"-part implies there is only one candidate key.





A more general normal form is more explicit with respect to alternate keys:


BCNF (Boyce/Codd Normal Form): The only determinants are superkeys

                              (superset of candidate key)




Example of a 3NF relation which is not BCNF.

 ENROLL
 S# C# GRADE TUTOR 
|32|8 | 89  |Xin  |
|32|7 | 91  |Sally|
|25|7 | 68  |Ahmed|
|25|6 | 76  |Ben  |
|32|6 | 62  |Amit |


Primary key = (S#,C#) and each tutor is assigned to only 1 course.
  (Course to which a tutor is assigned is determined, so TUTOR → C# )

FDs:
(S#,C#)→GRADE    (S#,C#)→TUTOR    (S#,C#)→(GRADE,TUTOR)   TUTOR→C#


This isn't BCNF (since Tutor is not superkey),

                but it is in 3NF  (Strictly speaking, since C# is not a non-key attr.






4th normal form (4NF):  BCNF and all Multivalue Dependencies (MVDs) are FDs



What are Multivalue Dependencies?

For R(A,X,Y), where A,X,Y are distinct attributes (possibly composite)

R[A,X], R[A,Y]  is a LOSSLESS decomposition of R  iff  R[A,X] A-JOIN R[A,Y] = R(A,X,Y)





A set of projections of a relation with at a one common attribute and such that
 every attribute is in at least one projection is called a DECOMPOSITION

The join of a decomposition is always a superset of the original relation.

Sometimes it is a proper superset
   (i.e., it includes SPURIOUS tuples that weren't in the original relation).




 R is always a subset of R[A,B] A-JOIN R[A,C]

PROOF:

 ∀ a,b,c ∈ R   (a,b)∈R[A,B] and a,c∈R[A,C].

 Thus,  a,b,c∈ R[A,B] A-JOIN R[A,C].





Heath's theorem says when the join is exactly the original relation
                (a lossless decomposition).
 



HEATH's THEOREM:

Given R(A,B,C), if A→B or A→C then  R = R[A,B] A-JOIN R[A,C]

  ie, if A→B or A→C then {R[A,B], R[A,C]} is a lossless decomposition



Again: the join of a decomposition always contains the original relation
       but it may be larger (contain spurious tuples)







Why call it a loss   when there are actually more tuples
                         (extra spurious ones) and

    call it lossless when there is no gain in size?




Proof of Heath's Theorem by contrapositive (Prove      P imples Q            true
                                            by showing NOT(Q) implies NOT(P) true)



ie, show NOT(R=R[A,B] A-JOIN R[A,C]) implies NOT(A→B or A→C ) = NOTA→B and NOTA→C

 (i.e., spurious tuples destroy at least one functional relationships)




NOT(R = R[A,B] A-JOIN R[A,C]) means

    ∃ (a,b,c) ∈ ( R[A,B] A-JOIN R[A,C] )  not in R (spurious)



(a,b,c) ∈ (R[A,B] JOIN R[A,C]) but ∉ R  implies (a,b) ∈ R[A,B] and
                                                 (a,c) ∈ R[A,C] implies

        ∃ (a,b,c'), (a,b',c) ∈ R∋: 

                c NOT= c'    and
                b NOT= b'    implies

                R.A does not functionally determine R.B   and
                R.A does not functionally determine R.C   QED.






Does Heath's Theorem say:  If R = R[A,B] A-JOIN R[A,C] then A→B
     ( which would tell us that A→B implies A→C )

No!     It is not an "if and only if".




Counter example:


  R:      R[A,B]:   R[A,C]:
a b  c      a b       a c
a b' c      a b'




R[A,B] JOIN R[A,C] = R     But A NOT→ B.

               a b  c
               a b' c


So the FD  A→B does not characterize lossless decomposition





Is there a condition which does characterize lossless decomposition?

   (i.e., an IF AND ONLY IF (IFF) condition for lossless decomposition)




Yes,   it is "Multivalued Dependency" or MVD:



Given R(A,B,C),   B is Multivalued Dependent on A

                  (or A multi-determines B), written: A→→B,

iff the set of B-values matching a given (a,c) pair in R
    depends only on the A-value, a

    (the same B-set is associated with a for all c's).





Another way: A→→B iff ∀ a∈R.A, RR.A=a[B,C] is a product.





If RR.A=a[B,C] is a product then clearly the set of B-values matching
 any pair, (a,c) is just the projection of that product onto the B-attribute
 and therefore A→→B





To prove A→→B implies RR.A=a[B,C] is a product,
 use the contrapositive arguement:



If there is an A-value, a, ∋: RR.A=a[B,C] is not a product then
       R contains tuples:

     a b1 c1
     a b1 c2
     a b2 c1              (but R does not contain  a b2 c2).




But then, for c1,  a→→b1,b2 
      but for c2,  a→→b1      not the same set of B-values!



MVD is a symmetric condition:

Theorem:  Given R(A,B,C),  A→→B  iff  A→→C

The proof follows directly from the previous Lemma
    (the condition, RR.A=a[B,C] is a product is symmetric in B and C).


Fagin's thm: {R[A,B],R[A,C]} is a lossless decompostion of R(A,B,C) iff A→→B

Proof:

To prove A→→B implies the decomposition is nonloss, prove the contrapositive:
 the decomposition is lossy implies A NOT→→ B.
   


   If the decomp is lossy, then there exists at least one (a,b) ∈ R[A,B]
   and (a,c) ∈ R[A,C] ∋: (a,b,c) ∉ R.




   Therefore, there is
   b' not= b in B such that (a,b',c) is in R and a
   c' not= c in C such that (a,b,c') is in R.

   Therefore pairs a,c and a,c' do not determine the same B-sets
   in R (since, b is in the B-set determined by a,c' while
           it is not in the B-set determined by a,c ).
       

To prove R=R[A,B]joinR[A,C] implies A→→B, prove the contrapositive:

         A NOT→→ B implies the decomp is lossy.

         A NOT→→ B means
                   there are distinct pairs a,c and a,c' ∈ R
         (and therefore in R[A,C]) which determine different
         B-sets in R, say b is in the B-set determined
         by a,c in R (and therefore a,b,c is in R)
         but b is not in the B-set determined by a,c' in R
         (and therefore a,b,c' is not in R),
         then a,b,c' is not in R.

But since a,b,c is in R, a,b is in R[A,B] and a,c' is in R[A,C]
so a,b,c' is in R[A,B] JOIN R[A,C]
but not in R and the decomposition is lossy.  




4th normal form (4NF)= BCNF and all MVDs are FDs

           (only dependencies are functional dependencies from superkeys)





OVERALL NORMALIZATION PROCESS:

0. Project off repeating groups (each as separate files with repeating group
   attribute as key and the original key as foreign key)
1. Take projections of 1NF to eliminate nonfull FDs - produce 2NF.
2. Take projections of 2NF to eliminate transitive FDs - produce 3NF.
3. Take projections of 3NF to eliminate remaining FDs where determinant
       is not candidate key - produce BCNF.
4. Take projections of BCNF to eliminate any MVDs not FDs - produce 4NF.






This discussion will stop with 4NF, however, there are 5NF, 6NF...18NF...

Normalization has become an art form.

It's not always completely clear what use will ever be made
 of some of these higher normal forms
 (that is; the anomolies being precluded are often somewhat
  involved and obscure).

some powerpoint notes on normalisation