The goals of this course include
 to initiate graduate student's into data and database systems research and
 to enhance graduate student's presentation skills of their research.



The GRADUATE SCHOOL, COLLEGE, TECHNICAL SCHOOL RELATIONSHIP

Universities, by definition, integrate research, teaching and service.


The Graduate school at a University has the primary responsibility for research.

A College has the primary responsibility for teaching.

A Vocational, Technical and Professional School has primary responsibility
for training in the use of specific existing tools of a trade, area or profession.



Don't be frightened away from research.

Even though 765 may be in your first graduate course, you have probably already
been doing a lot of research, so it won't be entirely new to you.


Research is just another word for active learning.
There is really little difference between active learning and research,

      sometimes with the slight difference that,

      early on, most concepts that you research
      have been pre-researched by others, while,

      later on, most concepts that you research
      have not been pre-research by others
   



      In both cases,


      the instructor

          advisor or guides,


      the student

           masters context, background and language of the area, and

           developes new or improved solutions to questions and problems.





A good researcher always takes the point of view that

          there is almost always a better way to do anything.



A good researcher questions the prevailing methods and challenge the

          current practices in an attempt to find a better way.



I like to call it finding a new, killer idea -and then

          taking the responsibility to prove that it is killer.





Additional reference material on most topics in this course can be found
 on the web by doing a google (or Yahoo or Ask) search on the
 appropriate keyword(s) and also  by using the NDSU library.





The topics covered in this course will include:

Horizontal Data
Vertical Data
Query Processing
Data Mining
Transaction Processing
Recovery
Normalization
Distributed Database Systems 
Security





INTRODUCTION



 CENTRALITY OF DATA


     Data are central to every computer program.

     If no data, there is
        no input,
        no output,
        no constant,
        no variable.



     It is hard to imagine a program in which there is no data?

     Therefore, virtually all programs are data management programs and

     therefore, virtually all computing involves data management.





However, not the all data in computer programs is RESIDUALIZED



    RESIDUALIZED data is data stored and managed after the

       termination of the program that generated it (for reuse later).



Database Management Systems (DBMSs) store and manage residualized data.







                   WHAT ARE SOME OF THE MAIN PROBLEMS WITH DATA



1. HUGE VOLUME    EVERYONE HAS LOTS OF DATA TODAY!


Data are collected much faster than it can be process or managed.



NASA's Earth Observation System (EOS), alone, has collected over
    15 petabytes of data already (15,000,000,000,000,000 bytes).

       Most of it will never be use  - ever!
       Most of it will never be seen - ever!
       Why not?

       There's so much volume,
       usefulness of some of it will never be discovered




SOLUTION: Reduce the volume and/or raise the information content thru
          structuring, querying, filtering, mining, summarizing, aggregating...


That is the main task of Data and Database workers today!



Claude Shannon's info theory principle comes into play here:

       More volume means less information. 



  ^                       . decision (1 bit? y/n)  
  |                      / \       
  |                     /   \ decision support
  VALUE                /     \   
   OF                 /       \ knowledge
  DATA               /         \   
 (INFO              /           \ information
 LEVEL)            /             \
  |               /               \ structured data
  |              /                 \
                /___________________\ raw data
                <---VOLUME of DATA--->




This is a simple example to illustrate the tradeoff

 between information level and volume level:




For example, the following 2 phone books are constructed from the same

             2 names (Steve, Dan) and 2 numbers (3479, 5683).



             Which phone book is the more useful?


             Which provides the more information?



   BOOK-1                       BOOK-2
   -------------                -------------
   Name    Phone                Name    Phone

   Steve   3479                 Steve   3479
   Dan     5683                 Dan     5683
                                Steve   5683
                                Dan     3479


The 2nd one has more volume but provides zero information.




REDUCING VOLUME WHILE
INCREASING INFORMATION CONTENT

may be the most important task in data and database system work today!






Another problem which needs attention is:


2. LACK OF STRUCTURE   THERE'S TOO MUCH RAW, UNSTRUCTURED DATA


   Previous example revisited:


        S4tD59a3ve7sen68


        is the previous phone data in totally raw, unstructured,
        non-redundant form and it is useless.  It needs structure.




How can data be structured?


HORIZONTAL STRUCTURING OF DATA:

 Form it into HORIZONTAL "records" of "fields".

 e.g., Employee records consisting of the horizontal concatenation of data fields


 SSN, LastName, MiddleIntital, FirstName, Address, Gender, OfficeNumber, Salary



 These horizontally structure records are then stored in a file (EMPLOYEE file?)

 One can order the records in those files by sorting the records (clustering)

 and/or provide meta-structures for quick access (pointers, indexes, hashs,..)




A Very important principle:

"Choose the sort attribute wisely!"

You have one chance to speed up access by sorting (clustering)


Seems obvious but it isn't to everyone.  E.g.,



Casey Stengel (former manager of the New York Yankees) once said

  "OK, you guys!  Line up alphabetically according to your height!".


  That cannot be done, in general

  (unless the alphabetic and height orderings happens to be identical,
   in which case the height and name are called "fully correlated" attributes.
   Such correlations constitute important information!)




DATA MINING can be thought of as the process of finding correlations
   relationships and other patterns in data.



Since you get just one physical ordering,
   other ordering requires a "meta"-structure (e.g., an index).




   Previous example re-revisited:

   The phone book can be physically order by

                   PHONE NUMBER            or by        NAME 

                   Name    Phone                        Name    Phone
                   ----    -----                        ----    -----
                   Steve   3479                         Dan     5683
                   Dan     5683                         Steve   3479  




Most phone books are ordered on NAME,

If you need fast access by PHONE NUMBER,
 you can build an index for it (which is an auxilliary file ordered on Phone#):



NAME ORDERED PHONE BOOK    ( RID = Record IDentifier )

   RID    Name    Phone
   ---    -----   ----
    1     Dan     5683
    2     Steve   3479



   
PHONE NUMBER INDEX

   Phone   RID
   ----    ---
   3479    2
   5679    1



STRUCTURING DATA

 may be the second most important task in data and database system work today!







Another important task to be addressed in data systems work today is:




3. RESIDUALIZATION OF DATA 


    MUCH WELL-STRUCTURED DATA IS DISCARDED PREMATURELY

    Databases are about storing data persistently, for later use.



age of infinite storage




RESIDUALIZING DATA

 may be the third most important task in data and database system work today!










WHAT IS A DATABASE?



An integrated shared repository of operational data of interest to an enterprise



INTEGRATED:        it must be the unification of several distinct files

SHARED:            same data can be used by more than 1 user (concurrently)

REPOSITORY:        implies "persistence".

OPERATIONAL DATA:  data on accounts, parts, patients, students,
                          employees, genes, stock, pixels,...
                   Non-operational: I/O data, transient data in buffers,
                   queues...

ENTERPRISE:        bank, warehouse, hospital, school, corp, gov agency, person..







WHAT IS A DATABASE MANAGEMENT SYSTEM (DBMS)




A program which organizes and manages access to residual data



Databases also contains METADATA  (data on the data).

Metadata is non-user data which contains the descriptive information
     about the data and database organization (Catalog data)







WHY USE A DATABASE?



COMPACTNESS           (saves space - no paper files necessary)


EASE OF USE           (less drudgery, more of the organizational and search
                       work done by the system; user specifies what, not how)


CENTRALIZED CONTROL   (by DB Administrator (DBA) and by the CEO)


REDUCES REDUNDANCY    (1 copy is enough, but then
                            concurrent use must be controlled)

NO INCONSISTENCIES    (again, since there is only 1 copy necessary)


ENFORCE STANDARDS     (corporate, dept, industry, national, international)


INTEGRITY CONSTRAINTS (automatically maintained)
                        (e.g., GENDER=male => MAIDEN_NAME=null)

BALANCE REQUIREMENTS  (even conflicting requirements?  DataBase Administrator 
                        (DBA) can optimize for the whole company)

DATA INDEPENDENCE     (occurs because applications are immune to
                         storage structure and access strategy changes.
                       Can change the storage structure without changing
                         the access programs and vice versa)






Almost all commerical databases today are HORIZONTAL

    Data is formed into files of horizontal records of a common type.

    So we will start by considering Horizontal databases.







HORIZONTAL DATA TERMINOLOGY



                       | stored   (physical, on disk)  |
FIELDS, RECORDS, FILES | logical  (as viewed by user)  |



                       | type         (e.g., datatype) |
FIELDS, RECORDS, FILES | occurrences  (instances)      |




"TYPE" and "OCCURRENCE" of files, records and fields (stored and logical).


TYPE:       defines structure and expected contents
            (time-independent - changes only upon DB reorganization)


OCCURRENCE: actual data instances at a given time (time-dependent - changes
            with every insert/delete/update)




STORED FIELD = smallest unit of stored data

                      .-----.
                e.g., |Jones|  is a "name" field occurrence
                      `-----'                    ==========
                Name:  Char 25 might be the metadata type of that occurrence.
                                                     ====
                   


STORED HORIZONTAL RECORD = named collection of related stored fields.

          .-----.----.----------.-----.--.-----.
    e.g., |Jones|John|412 Elm St|Fargo|ND|58102|
          `-----'----'----------'-----'--'-----'

            


 Employee: Lname(char 25), Fname(char 15), Address(char 20),

           City(char 15), St(char 2),     Zip(char 5)

           might be the metadata type of that record.




STORED FILE = named collection of all occurrences of 1 type of stored record


     |LNAME|FNAME| ADDRESS  |CITY |ST| ZIP |  < --- "employee file" type
     |=====|=====|==========|=====|==|=====|

     |Jones|John |412 Elm St|Fargo|ND|58102|  < ---."employee file" occurrences
     |-----|-----|----------|-----|--|-----|       | 
     |Smith|James|415 Oak St|Mhd  |MN|56560|  < ---|
     |-----|-----|----------|-----|--|-----|       |
     |Thom |Bob  |12 Main St|Mhd  |MN|56560|  < ---|
     |-----|-----|----------|-----|--|-----|       |
     |Trath|Phil |345 12thSt|Fargo|ND|58105|  < ---'
     `-----'-----'----------'-----'--'-----'


The "employee file" type is the "employee record" type

    (+ possibly, some other type characteristics such as max-#-records..)




In todays storage device world, there is still really only linear storage space.


Therefore the 2-dimensional picture of a stored file is just a picture and
strictly speaking, not possible in physical storage media.

The more accurately description of the above store file
(as a linear sequence of stored records) is:

|Jones|John |412 Elm St|Fargo|ND|58102|Smith|James|415 Oak St|Mhd  |MN|56560|Thom |Bob  |12 Main St|Mhd  |MN|56560|Trath|Phil |345 12thSt|Fargo|ND|58105|




Some day there may be truly 2-dimensional storage devices (or even 3-D devices)
such as holographic storage devices.  At that time, the 2-D picture of a stored
file, above, may be an accurate depiction.




The way these entities are stored is not necessarily the same
  as the way in which they are view or known to users.



They may be known to the users in various "logical" variations.

A simple example of a logical record coming from the physical employee record:

     .-----.-.-----.--.
     |Jones|J|Fargo|ND|
     `-----'-'-----'--'


So we have 

LOGICAL FIELD = smallest unit of logical data

LOGICAL RECORD= named collection of related logical fields.

LOGICAL FILE  = named collection of occurrences of 1 type of logical record.

           which may or may not correspond to the physical entities.





Unfortunately there is a lot of variation in terminology.
  It will suffice to "equate" terms as follows in this course:



TERMINOLOGY
COMMON USAGE	RELATIONAL MODEL 	TABULAR USAGE    
------------	----------------	-------------
File		Relation		Table
Record		tuple			row
field		attribute		column



When we need to be more careful we will use:



    relation is a "set" of tuples

              whereas a

    table    is a "sequence" of rows or records (has order)





    tuple is a "set" of fields

              whereas a

    row or record   is a "sequence" of fields (has order)









PHYSICAL DATA MODELS



         (for conceptualizing (logically)
          and storing (physically) data in a database)





HORIZONTAL MODELS for files of horizontal records

     - processing is thru vertical "scans" (or "access-path-enhanced-scans")




RELATIONAL    (simple flat unordered files or relations of records
                     of tuples of unordered field values)


TABULAR       (ordered files of ordered fields)



INVERTED LIST (Tabular with an access paths (index?) on every field)



HIERARCHICAL  (files with hierarchical links)



NETWORK       (files with record chains)



OBJECT-RELATIONAL  (Relational with  "Large OBject" (LOBs) fields)

                     (attributes which point to or contain complex objects)





VERTICAL MODELS (for vertical vectors or trees of attribute values
                      processing is typically thru horizontal AND/OR programs)


BINARY STORAGE MODEL (Copeland ~1986)  (uses vertical value and bit vectors)


BIT TRANSPOSE FILES  (Wang ~1988)  (used vertical bit files)


VIPER STRUCTURES    (~1998)  (used vertical bit vectors for data mining)


PREDICATE-TREEES or PTREES (vertical bit trees) (~1997)







A REVIEW OF HORIZONTAL DATA MODELS






RELATIONAL DATA MODEL



(only construct allowed is the simple flat "relation")



  STUDENT-FILE       COURSE FILE
|S#|SNAME |LCODE | |C#|CNAME|SITE|
|==|======|======| |==|=====|====|
|25|CLAY  |NJ5101| |8 |DSDE |ND  |
|32|THAISZ|NJ5102| |7 |CUS  |ND  |
|38|GOOD  |FL6321| |6 |3UA  |NJ  |
|17|BAID  |NY2091| |5 |3UA  |ND  |
|57|BROWN |NY2092| `-------------'

 ENROLL FILE       STATUS FILE
|S#|C#|GRADE|    |LCODE |STATUS|
|==|==|=====|    |======|======|
|32|8 | 89  |    |NJ5101|  1   |
|32|7 | 91  |    |NJ5102|  1   |
|25|7 | 68  |    |FL6321|  4   |
|25|6 | 76  |    |NY2091|  3   |
|32|6 | 62  |    |NY2092|  3   |
|38|6 | 98  |    `-------------'
|17|5 | 96  |

* STUDENT and COURSE represent entities.
* ENROLL file represents relationship between Students & Courses (many-many)
* STATUS represents a relationship between LCODE and STATUS (1-to-many)








HIERARCHICAL MODEL




                    entities      = records,
                    relationships = links of records forming trees


           (terminology in COMMON USAGE: files, records, fields)



EDUCATION example:


 root type is             STUDENT    (with attributes S#, NAME, LOCATION),

 dependent type is        COURSE     (with attributes C#, CNAME),

 2nd-level dependent type ENROLLMENT (with attrs, GRADE, LOC)



____________  _____________  ___________
25|CLAY|OTBK  32|THAISZ|KNB  38|GOOD|GTR     STUDENTS
__|____|____  __|______|___  __|____|___
       |                 |          |
    ___|_             ___|__      __|__
   |7|CUS|           |8|DSDE|    |6|3UA|
___|_|___|       .___|_|____|    |_|___|     COURSES
|6|3UA| |        |7|CUS |  |        |
|__|__| |    .___|_|____|  |        |
  |     |    |6|3UA|   |   |        |
  |     |    |_|___|   |   |        |
  |   _____       |    |  _|___   __|__
  |  |ND|68|      |    | |ND|89   NJ|98|
  |  |__|__|      |    | |_|___   __|__|     ENROLLMENTS
 _|___       _____|  __|__
|NJ|76|     |ND|62| |ND|91|
|__|__|     |__|__| |__|__|


   If the typical workload involves producing class lists
      for students, this organization is optimal.


   If the typical workload is producing course enrollment
      lists for professors, it is very poor.


   The point is, Hierarchical Data Model almost always favors
      a particular workload category (at the expense of the others).







NETWORK MODEL



           entities = records,

           relationships and by owner-member chains (sets)
           (many-to-many relationships easily represented)





EDUCATION example 2:


        owner-member chains STUDENT-ENROLLMENT COURSE-ENROLLMENT

______________       ________________
25|CLAY|NJ5101|<. .<|32|THAISZ|NJ5102|      STUDENTS
.-------------  : :  ---------------- 
.    __     __  : :  __     __    __ ^
`-->|68|-->|76|'  `>|89|-->|91|->|62|'      ENROLLMENTS
  .>`--.  >`--.  .->`--'  >`--:  >--'v
  :    :  :   : :      `. :   :  :   :
  :    `--:- -:-:- - - -:-'   :  :   :
  :       :   : :       `.    :  :   :
  :       :   : :        : .<-'  :   :
  :       :   `-:- - - --:-:- - -'   :
  :       `--.  :        : :         :
.-: - - - - -:-'         : :         :
| :_ _ _ _   `- - - - - -:-:-.       :
^_______  `-.  ________  : : ^ ______v_
8DSDE|ND|-. `<|7|CUS|ND|-:-' `|6|3UA|NJ|      COURSES
--------  .    --------  :     --------
          ` - - - - - - -'


simple to insert  (create new record and reset pointers),
          delete  (reset pointers),
          update  (always just 1 copy to worry about, NO REDUNDANCY)



network approach: fast processing,
                  complicated structure (usually requires data processing shop)


                  Again, it favors one workload type over others.






INVERTED LIST MODEL (TABULAR):




Flat Ordered Files

 (like relational except there's intrinsic order visible to
                user programs on both tuples and attributes)


    Order is usually "arrival order", means each record is given a
    unique "Relative Record Number" or RRN when it is inserted.
 
       - RRNs never change (unless there is a database reorganization).
       - Programs can access records by RRN.



   Physical placement of records on pages is in RRN order

    ("clustered on RRN" so that application programs can
      efficiently retrieve in RRN order.




Indexes, etc can be provided for other access paths (and orderings).

  1st page-.
 __________V_
|RID | S# |ST|           STATE-INDEX
|====|====|==|           .___._____.
| 1  | 25 |NJ|-._        |RID|STATE|
|----|----|--|   \       |===|=====|
| 2  | 32 |NJ|--. \   .--|_3_|_FL__|
|----|----|--|   `-\-/---|_2_|_NJ__|
| 3  | 38 |FL|------'----|_1_|_NJ__|
|----|----|--|        __-|_4_|_NY__|
| 4  | 47 |NY|-------' _-|_5_|_NY__|
`------------'        '
                     |
 ____________        |
| 5  | 57 |NY|-------'
|----|----|--|
|----|----|--|
|----|----|--|
|----|----|--|
|----|----|--|
|----|----|--|
`------------'
            ^
   2nd page-'









OBJECT RELATIONAL MODEL



Object Relational Model (OR model) is like relational model except

       repeating groups are allowed (many levels of repeating groups -
                                     even nested repeating groups)

       and

       pointers to very complex structures are allowed
       (LOBs for Large OBjects,


          BLOBs for Binary Large OBjects, etc.












VERTICAL DATA MODELS

 

POWERPOINT SLIDES INTRODUCING THE CURRENT WORLD OF DATA AND
AN INTRODUCTION TO VERTICAL DATA (P-TREES): vertical data














WHAT ARE SOME OF THE IMPORTANT
NEW CATEGORIES OF DATA TODAY?









Business Intelligence Data



The Business Intelligence Problem can be described as the problem of determining

 who is most likely to want what product, when?
 


This is obviously a critically important question to be able to answer for
any business today - especially web-based businesses.

Today, businesses cannot advertise to everyone

 and even if they did, the fact that everyone gets the advertisement would
 mean it is less likely to be effective.

Therefore targeted advertisers must answers the WHO WHAT WHEN problem:




How does a business determine:

    who to advertise
    what products to,
    when?





That's not a matter of just querying a database?
It would be nice if we had database query languages that were high
enough level so we could just issue that query and get the answer,
but we are decades away from that

 (maybe someone in this room will become a billionare developing that language?)


I can tell you that SQL is NOT that language or anything close.

The language will be a new

DATA MINING LANGUAGE


Data Mining Language is one of the very hottest topics these days,
 but it is in it's infancy (those versed in language issues take note!).

Before a DM Language can be developed, Data Mining techniques need to
be much more highly developed, tested and understood.

We will concetrate a lot of this course on that issue.



How do we test data mining techniques?



Traditionally, BENCHMARK SUITES are developed by industry groups,
against which each new product is run to evaluate its performance.


Most Benchmarks Suites are created as more or less, toys
 in size and realisticness (my opinion),
 by the same people who will later run their products against those benchmarks
 (so bias is always a concern)
 and therefore the BENCHMARKS fail to be very useful to anyone



 (Witnesss: Has any new database system every failed to win
            on the TPC benchmarks for database queries?)



FORTUNATELY there is still hope for a legitimate and useful
Business intelligence Benchmark (my opinion).  

It may come out of the NETFLIX PRIZE dataset and software.


Netflix may be the smartest company in the world today!

I base that on the fact that they have set a $1,000,000.00 prize for anyone
(team) that can improve their current business intelligence algorithm by at
least 10%.

They give 5 years to do it!

Over 100,000 of the best data mining scientists (and students) in the world
are devoting most of the energies to winning this contest right now
(over 25,000 teams registered!).

That's $10 per scientist!  Try to hire a scientist for $10 ?

The current leader board shows an improvement of ~7-8 % already
so there is some success.

Each year (year 1 ends in October of 2007) they give $50,000 to the
LEADER BOARD leader.



If someone post a winning solution (bettering the Netflix algorithm by at least 10%,

ALL OTHER CONTESTANTS GET AN ADDITIONAL 30 DAYS TO BEAT THAT SCORE!

Can you imagine how much additional, intense effort they will get for free
during those 30 days?!?


That is so smart!!


The point really is though, that this dataset is a real life Business Intelligence
data set (massive and real, not toy!) and
the problem is THE Business intelligence problem!

It WILL develop into a Benchmark which will be used by everyone for decades!


You might consider developing your research paper/presentation problem
around the Netflix problem and data set

(Just a portion of the problem?
or one simple step toward the solution?
If you should end up solving the full problem,
you won't have to care about this course any more!)

I'd be happy to talk about the problem more and talk about paper topics that
 might come out of the problem.   Let me know if you are interested in that.















Medical Imaging Data



There are many technologies for medical imaging these days, including

XRay

Computed Tomography (CAT scans or CTA scans)

Magnetic Resonance Imaging (MRI scans)

Positron Emmision Tomography (PET scans)

Electromyography (EMG)

Electrocardiography (ECG or EKG)

Electroencephalography (EEG)

Magnetoencephalography (MEG)


just to name a few.



All of these produce an image which is processed into a data set with
many derived attributes (by software developed by experts in the field
to isolate the important characteristics of spots on the image and the
area surrounding spots on the image).

These derived data set must be data mined to determine evidence of desease
(Computer Aided Detection or CAD).  This is a very hot topic these days.


A good benchmark is developing in this CAD data area which developed
 out of the Association of Computing Machinery
 Knowledge Discovery and Datamining Cup  or  KDD CUP
 contest in 2006 ACM KDD Cup 2006



The NDSU team won the NPV Task of that contest, which was advertised as
THE HOLY GRAIL OF CAD

Here are some notes describing the work: PE CAD

Here you will find a paper describing each winner's work  ACM SIGKDD Explorations, Dec, 2006


















Market Basket Research (MBR) Data


(Which can be view horizontally or vertically)





Some interesting real-life FACTs:
---------------------------------


 - Walmart lead the revolution (but in parallel with published results -
   they keep all their discoveries secret - still do)

        - they used MBR data mining for SUPPLY MANAGEMENT

        - they used MBR data mining to know precisely
             What quantity of each item to have at
             Where (which stores),
             When.

 



Proof that MBR data mining is important
---------------------------------------

     - in 1990 Kmart was much larger than Walmart

     - Kmart decided not to do data mining and
       only started getting into it in the last few years

     - Recently, two events occured
         Kmart filed for bancrupcy protection
         Walmart became the largest corporation in the world


So data management and mining is extremely useful/important!










Market basket data (MBR), the Set Model:


   Cash Register Data at a retail store

        A "transaction" or tuple in a MBR file is a customer at a
           particular time, going through checkout with a "market basket"
           or shopping cart full of items.



       Data is organized as a table with 2 attributes




"Set Transaction Table":  STT(Tid, Itemset)


     Tid     is transaction identifier (a particular market basket, and
     Itemset is the set of items associated with Tid (the items in the basket).



     Note: tuples are not "flat" (each itemset is a "set") (not RELATIONAL)


           That's a problem for efficient mgmt and analysis, so typically
           a change-over is made to a "dual model", the binary model:






"Binary Transaction Table": BTT(Tid, Item-1, Item-2,... Item-n)


    Each Row    is a customer transaction (1 cust thru checkout 1 time)

    Each column is a particular Item (1 column for each item)

                The value is either a 1-bit ( if item is in the basket)
                          or a 0-bit (item is not in the basket)


    Items = precisely the items available for purchase

            (in this analysis we don't care how many of an item the customer
             has in his/her cart, just whether he/she has that item)





The SetTransTable, STT(Tid, ItemSet) can be

    "transformed" to the binary model, BTT, but also it can be

    "rotated" to the SetItemTable SIT(Iid, TransSet)

      - SIT can be transformed to the binary model, BIT, also.
            but we note it is just the transpose of BTT.




 - This simple tranformation of the structure from a two
     column non-relational model to the binary model set the stage for
     a whole revolution in MBR data mining from 92-present!








STT(Tid,ItemSet) and BTT(Tid,Item1..Item-n) don't deal with item hierarchy


      eg, milk is an item at 1 level, but breaks down into

          skim, whole, 2%..  at the next level down the concept hierarchy...


   Work on hierarchical Market Basket Research has just begun.

   It would make a great research topic for this course.






Thus, in the MBR models, these "Binary Transaction Tables" are
    - extremely wide         (many many feature attributes)

    - extremely sparse       (mostly 0's - most customers buy just a few
                                           items in one shopping trip!)








Bioinformatics/genetics data is remarkably similar to MBD

Microarray Data Analysis (MDA) is the analysis of the gene expression levels
 of thousands of genes spotted on glass slides called a Microarrays and
 subjected to various "treatments" or experiments



MDA is often stored as an Excel spreadsheet,

RealGeneTbl:    RGT(Gid,E1...En)

       - row    = gene
       - column = experiment (plus other columns)
       - value = expression value or log ratio of expression values

           




BinaryGeneTbl:  BGT(Gid,E1...En)

   is the table you get by setting a threshold expression ratio
   and recording 1 iff it is exceeded:



SetGeneTbl: SGT(Gid, ExpSet) is its tanslation to the set model.



SetExpTbl: SET(Eid, GeneSet) is obtained by rotating SGT
   row = experiment,   
   col = Genes whose expr level exceeds threshold for that exp

 

BinaryExpTbl: BET(Eid, Gene1,...,GeneN) is similar to BTT in MBR


Microarray methodology animation Microarray methodology animation





A fairly good benchmark is developing in this MBR data area also which developed
 out of the Association of Computing Machinery
 Knowledge Discovery and Datamining Cup  or  KDD CUP
 contest in 2002 ACM KDD Cup 2002

The NDSU team tie for second on Task 3 of that contest (actually
got the highest score on the "broad class problem" of that task).
Here is a paper describing each winner's and placer's work
 ACM SIGKDD Explorations, Dec, 2002





Formally in MBR, the BTT is defined as follows:


I={i1..im} is the set of items.

     - eg, an item for purchase in a store

     - Each item in a store is an attribute, Ai,

       - with Boolean values (1 = in a customer's "market basket
                             or shopping cart" and 0=not in it).



A is an itemsets (or Iset) if it is a subset of I,

    - eg, a set of items for purchase from a store



D = {ti..tn} is the set of transactions

    - each ti consists of an identifier and an itemset:
           ti = ( t-id, t-itemset )

    - eg, a customer going thru checkout,
          the set of items in her/his cart




A transaction,t, SUPPORTS an itemset,A,
                          if A IS CONTAINED IN t-itemset.






Stock market analysis data


It is not even clear what data should be considered
 "Stock Market Analysis Data" since it is not clear what are
 the best "signals" for future stock rise or fall?

This is an exciting area in which to work (but difficult).

Those who have strong backgrounds in financial data and the
 stock market are invited to do a project in this area.






Signal data (scientific)


A signal can be thought of as any function

      f:{0,1,2,3...} -- > Real or Complex numbers
 or   f:Real_numbers -- > Real or Complex numbers


The first is referred to as a discrete signal and the second
 as a continuous signal (domain is often taken to be time axis).


Signals can also be defined on the Complex number space:

      f:Complex_numbers -- > Real or Complex numbers


Signals can certainly be defined on a Cartesian product of any
 or all of the domains above, as well.


Signal data comes from many many sources (sensors), including

 satellite platforms,
 aerial platforms,
 radar facilities,
 fixed position ground sensors and many more,
 nano-sensors,
 etc.



Signal processing is usually 1st step in making this data useful

But typically it is not enough.



Alien Technologies is building a 50 Million dollar plant 3 blocks from here
(in the NDSU Research Technology Park - just west 18th ST and south of 19 Ave.)
in which they will produce billions of Radio Frequency ID tags (RFid tags) for
Walmart, Gillette and etc.

These nano-scale passive RF tags will be embedded in every pallete that
Walmart and Gillette etc. ship. That way an active RF sensor can be passed
within 10 feet of the pallete and read the information (like bar code info).

The building just east and north of Phoenix Engineering
houses the Center for Nano-Scale Science and Engineering or CNSE
CNSE
where research on nano-scale products is being conduted for the defense
department (in consort with Alien Corp.).






VLSI design data


The VLSI component testing process involves massive arrays of
 binary data (hundreds of rows and billions of columns)
 and a hierarchical structure of subcomponents.







Data Warehouse (multidimensional) data


data warehouse slides (DW)



The following is a text based treatment of the topic.  It will not be
lectured over but is included for an additional treatment of the subject
and for review and practice.


                      MULTIDIMENSIONAL DATA


(roughly, it's data with more than 1 natural "keys" or look up attributes)

Multidimnesional data comes up in the context of Datawarehouses

 - A datawarehouse (DW) is a database which is typically:
    - static (no updates, only inserts of new data)
    - read only

 - In a datawarehouse, at the end of each business cylce
     (end of the day - after the data is no longer changing
      due to intra-day activity) a copy of the relevant data
      is appended to the DW with the proper timestamp.

 - Knowledge workers can analyze the data without having to wait
   for updates to be installed (no concurrency control required)


Multidimensional data is often organized into a Data Cube.

      - A data cube allows data to be modeled and viewed in multiple dimensions.
      - It is defined by dimensions and facts.

  - The Fact is the central theme of the organization and
    instances of the fact are contained in a FACT TABLE

   - the Fact table contains key attributes (one for each dimension
     that defines the fact) and descriptive attributes of the
     facts (measurements).

  - Dimensions are perspectives from which to view facts
     - combinations of dimension keys are keys for the facts
     - Dimensions records are kept in a dimension files.






Example:  AllElectronics Sales:

A fact is a sale
Dimensions are time    (Q1,Q2,Q3,Q4),
               item    (home-ent,computer,phone,security) and
               location(Chicago,New_York,Toronto,Vancouver)


Relational Model (4 relations, 1 for each location, with fixed
records, Q1,Q2,Q3,Q4  and identical attributes, ent,com,ph,sec
 
     Loc="Chicago"      Loc="New York"      Loc="Toronto"     Loc="Vancouver"
     ent  com ph sec    ent  com ph  sec    ent  com ph sec   ent  com ph sec
Q1   854  882 89 623   1087  968 38  872    818  746 43 591   605  825 14 400
Q2   943  890 64 698   1130 1024 41  925    894  769 52 682   680  952 31 512
Q3  1032  924 59 789   1034 1048 45 1002    940  795 58 728   812 1023 30 501
Q4  1129  992 63 870   1142 1091 54  984    978  864 59 784   927 1038 38 580

Note that ent,com,ph,sec are not really "descriptive attributes" of quarters
and  quarters are not really tuple instances since they are always the same 4.

A better model is the

Star Model (3-D Fact Cube of sales by time/item/location:
                                 ____________________________                   
                    y           /      /      /      /      /|
                   t    Chicago/      /      /      /      / |
                  c           /  854 /  882 /  89  /  623 /  |
                             /______/______/______/______/   |
                N           /      /      /      /      /|   |                 
               O   New York/      /      /      /      / |  /|                 
              I           / 1087 /  968 /   38 /  872 /  | / |                 
             T           /______/______/______/______/   |/ 8|
            A           /      /      /      /      /|   / 9 |             
           C    Toronto/      /      /      /      / |  /|6 /|             
          O           /  818 /  746 /   43 /  591 /  | / | / |            
         L           /______/______/______/______/   |/ 5|/ 9|
                    /      /      /      /      /|   | 2 | 8 |
          Vancouver/      /      /      /      / |  /|9 /|7 /|
                  /      /      /      /      /  | / | / 2 / | 
                 /______/______/______/______/   |/ 2|/ 0|/ 0|
                 |      |      |      |      |   | 8 | 0 | 7 /               
       T      Q1 |  605 |  825 |  14  | 400  |  /|6 /|1 /|8 /                
       I         |      |      |      |      | / | / | / | /                 
       M         |______|______|______|______|/  |/ 8|/ 4|/                  
       E         |      |      |      |      |   | 2 | 8 /                   
              Q2 |  680 |  952 |  31  | 512  |  /|7 /|9 /                    
       q         |      |      |      |      | / | / | /                     
       t         |______|______|______|______|/  |/ 4|/                      
       r         |      |      |      |      |   | 8 /                       
       s      Q3 |  812 | 1023 |  30  | 501  |  /|7 /                        
                 |      |      |      |      | / | /                  
                 |______|______|______|______|/  |/                  
                 |      |      |      |      |   /                  
              Q4 |  927 | 1038 |  38  | 580  |  /                  
                 |      |      |      |      | /                  
                 |______|______|______|______|/                  
                home-ent  comp  phone  security                 
                             ITEM                           
                                                             
Then each dimension (ITEM, TIME, LOCATION) is fully described in
a separate table (dimension table.


If there are 4 dimensions (add supplier), the
4-D Fact Cube of sales by time/item/location/supplier
can be shown as a series of 3-D cubes:
                                                               
                supplier= "SUP1"
            ___________________    
  C Chicago/    /    /    /   /|
 O        /____/____/____/___/ |
L      NY/    /    /    /   /| |
        /____/____/____/___/ |/|
Toronto/    /    /    /   /| / |
      /____/____/____/___/ |/|/|
Vancover  /    /    /   /| | | |
    /____/____/____/___/ |/|/|/|
  Q1|605 |825 | 14 |400| | | | /
T   |____|____|____|___|/|/|/|/ 
I Q2|    |    |    |   | | | /   supplier= "SUP2"
M   |____|____|____|___|/|/|/  ________________ 
E Q3|    |    |    |   | | /  /   /   /   /   /|
    |____|____|____|___|/|/  /___/___/___/___/ |
  Q4|    |    |    |   | /  /   /   /   /   /| |
    |____|____|____|___|/  /___/___/___/___/ |/|
      ent comp ph   sec   /   /   /   /   /| / |
         I T E M         /___/___/___/___/ |/|/|
                        /   /   /   /   /| | | |
                       /___/___/___/___/ |/|/|/|
                       |   |   |   |   | | | | /
                       |___|___|___|___|/|/|/|/ 
                       |   |   |   |   | | | /  supplier="SUP3"
                       |___|___|___|___|/|/|/  ________________ 
                       |   |   |   |   | | /  /   /   /   /   /|
                       |___|___|___|___|/|/  /___/___/___/___/ |
                       |   |   |   |   | /  /   /   /   /   /| |
                       |___|___|___|___|/  /___/___/___/___/ |/|
                                          /   /   /   /   /| / |
                                         /___/___/___/___/ |/|/|
                                        /   /   /   /   /| | | |
                                       /___/___/___/___/ |/|/|/|
                                       |   |   |   |   | | | | /
                                       |___|___|___|___|/|/|/|/ 
                                       |   |   |   |   | | | /  
                                       |___|___|___|___|/|/|/
                                       |   |   |   |   | | /  
                                       |___|___|___|___|/|/  
                                       |   |   |   |   | / 
                                       |___|___|___|___|/
                                                                       
                                                                          

Cubes are simply a visual representation to show
 how the data can be understood.


Multidimensional data is usually stored
  along with precomputed aggregates of high interest
  (sums, averages,...) to the knowledge worker using the
   data warehouse, such as sums, averages, etc.

These aggregates are "rolled up" and stored as part of the cube
  as follows:


                                                                          
                           Adding   Total Sales by location-time
                         ("rolling up" by summing along item)
                                                        /      :
           y         .======.======.======.=======.-------.   /:
          t  Chicago/      /      /      /      //       / | / :
         c         /  854 /  882 /  89  /  623 //  2448 /  |   :
                  /______/______/______/______//_______/   |   :
       N         /      /      /      /      //       /|   |  /:
      O New York/      /      /      /      //       / |  /| / :
     I         / 1087 /  968 /   38 /  872 //  2965 /  | / |   :
    T         /______/______/______/______//_______/   |/  |   :
   A         /      /      /      /      //       /|   /25 |  /:
  C  Toronto/      /      /      /      //       / |  /|95/| / :
 O         /  818 /  746 /   43 /  591 //  2198 /  | / | / |   :
L         /______/______/______/______//_______/   |/  |/  |   :
         /      /      /      /      //       /|   |31 |28 |  /
Vancouver      /      /      /      //       / |  /|20/|04/| /
       /      /      /      /      //       /  | / | / | / | 
  ITEM/_ent__/_comp_/_phone/_sec__//_______/   |/  |/  |/  |
      |      |      |      |      ||       |   |23 |31 |30 /
   Q1 |  605 |  825 |  14  | 400  || 1844  |  /|97/|29/|54/
      |      |      |      |      ||       | / | / | / | /
T     |______|______|______|______||_______|/  |/  |/  |/
I     |      |      |      |      ||       |   |25 |32 /
M  Q2 |  680 |  952 |  31  | 512  || 2175  |  /|21/|71/
E     |      |      |      |      ||       | / | / | /
      |______|______|______|______||_______|/  |/  |/
q     |      |      |      |      ||       |   |26 / 
t  Q3 |  812 | 1023 |  30  | 501  || 2366  |  /|85/
r     |      |      |      |      ||       | / | /
s     |______|______|______|______||_______|/  |/  
      |      |      |      |      ||       |   / 
   Q4 |  927 | 1038 |  38  | 580  || 2583  |  /   
      |      |      |      |      ||       | /
      |______|______|______|______||_______|/                                
                                                              






                                                                                 
                                  Total Sales by location-time
                                                       /         
           y           .======.======.======.=======.-------.
          t    Chicago/      /      /      /      //       / |
         c           /  854 /  882 /  89  /  623 //  2448 /  |
                    /______/______/______/______//_______/   |
       N           /      /      /      /      //       /|   |
      O   New York/      /      /      /      //       / |  /|
     I           / 1087 /  968 /   38 /  872 //  2965 /  | / |
    T           /______/______/______/______//_______/   |/  |
   A           /      /      /      /      //       /|   /25 |
  C    Toronto/      /      /      /      //       / |  /|95/|
 O           /  818 /  746 /   43 /  591 //  2198 /  | / | / |
L           /______/______/______/______//_______/   |/  |/  |
           /      /      /      /      //       /|   |31 |28 |
 Vancouver/      /      /      /      //       / |  /|20/|04/|
         /      /      /      /      //       /  | / | / | / |
ITEM->  /_ent__/_comp_/_phone/_sec__//_______/   |/  |/  |/  |
        |      |      |      |      ||       |   |23 |31 |30 /
     Q1 |  605 |  825 |  14  | 400  || 1844  |  /|97/|29/|54/
        |      |      |      |      ||       | / | / | / | /
        |______|______|______|______||_______|/  |/  |/  |/
        |      |      |      |      ||       |   |25 |32 /
     Q2 |  680 |  952 |  31  | 512  || 2175  |  /|21/|71/
        |      |      |      |      ||       | / | / | /
        |______|______|______|______||_______|/  |/  |/
        |      |      |      |      ||       |   |26 /
     Q3 |  812 | 1023 |  30  | 501  || 2366  |  /|85/
        |      |      |      |      ||       | / | /   
        |______|______|______|______||_______|/  |/
        |      |      |      |      ||       |   / 
     Q4 |  927 | 1038 |  38  | 580  || 2583  |  /
        |      |      |      |      ||       | /
        |======|======|======|======='-------' 
        |      |      |      |      |   /  
        | 3024 | 3838 |  113 | 1993 |  /    
        |      |      |      |      | /      
        |______|______|______|______|/        

Adding  Total Sales by location-item
("rolling up" by summing along time)






                                  Total Sales by location-time
                                                       /         
           y           .======.======.======.=======.-------.
          t    Chicago/      /      /      /      //       / |
         c           /  854 /  882 /  89  /  623 //  2448 /  |
                    /______/______/______/______//_______/   |
       N           /      /      /      /      //       /|   |
      O   New York/      /      /      /      //       / |  /|
     I           / 1087 /  968 /   38 /  872 //  2965 /  | / |
    T           /______/______/______/______//_______/   |/  |
   A           /      /      /      /      //       /|   /25 |
  C    Toronto/      /      /      /      //       / |  /|95/|
 O           /  818 /  746 /   43 /  591 //  2198 /  | / | / |
L           /______/______/______/______//_______/   |/  |/  |
           /      /      /      /      //       /|   |31 |28 |
 Vancouver/      /      /      /      //       / |  /|20/|04/|
         /      /      /      /      //       /  | / | / | / |
 ---->  /_ent__/_comp_/_phone/_sec__//_______/   |/  |/  |/  |
        |      |      |      |      ||       |   |23 |22 |30/|
     Q1 |  605 |  825 |  14  | 400  || 1834  |  /|67/|29/|54/|
        |      |      |      |      ||       | / | / | / |// |
        |______|______|______|______||_______|/  |/  |/  |/  |
        |      |      |      |      ||       |   |25 |32//218/
     Q2 |  680 |  952 |  31  | 512  || 2175  |  /|21/|71/|02/
        |      |      |      |      ||       | / | / |// | /
        |______|______|______|______||_______|/  |/  |/  |/
        |      |      |      |      ||       |   |26//312/
     Q3 |  812 | 1023 |  30  | 501  || 2366  |  /|85/|54/
        |      |      |      |      ||       | / |// | / 
        |______|______|______|______||_______|/  |/  |/
        |      |      |      |      ||       |  //187/
     Q4 |  927 | 1038 |  38  | 580  || 2583  | //|69/
        |      |      |      |      ||_______|// | /
        |======|======|======|======||=======|/  |/
        |      |      |      |      |||      |   /    
----->  | 3024 | 3838 | 113  | 1993 ||| 8968 |  /      
        |      |      |      |      |||      | /        
        |______|______|______|______|||__/___|/         
                                        /
Total Sales by location-item     Adding Tot-sales by location
                                ("rolling up" by summing time) 
                                (equiv, rolling right on item)









                    Adding   Total Sales by item-time
              ("rolling up" by summing along location)
                            __|______|______|______|____
                           /  |   /  |   /  |   /  |   /|
                          /      /      /      /      / | 
                         / 3364 / 3421 /  184 / 2486 /  |  
                        /      /      /      /      /   |   
           y           /======/======/======/=======.--------
          t    Chicago/      /      /      /      //       / |
         c           /  854 /  882 /  89  /  623 //  2448 /  |
                    /______/______/______/______//_______/ e |
       N           /      /      /      /      //       /|m  |
      O   New York/      /      /      /      //       / i  /|
     I           / 1087 /  968 /   38 /  872 //  2965 / t| / |
    T           /______/______/______/______//_______/   |/  |
   A           /      /      /      /      //       /|c  /25 |
  C    Toronto/      /      /      /      //       / o  /|95/|
 O           /  818 /  746 /   43 /  591 //  2198 / l| / | / |
L           /______/______/______/______//_______/   |/  |/  |
           /      /      /      /      //       /|s  |31 |28 |
 Vancouver/      /      /      /      //       / e  /|20/|04/|
         /      /      /      /      //       / l| / | / | / |
 ---->  /_ent__/_comp_/_phone/_sec__//_______/ a |/  |/  |/  |
        |      |      |      |      ||       |s  |23 |31 |30 |
     Q1 |  605 |  825 |  14  | 400  || 1844  |  /|97/|29/|54/|
        |      |      |      |      ||       | / | / | / | / |
        |______|______|______|______||_______|/  |/  |/  |/  |
        |      |      |      |      ||       |   |25 |32 /218/
     Q2 |  680 |  952 |  31  | 512  || 2175  |  /|21/|71/|02/
        |      |      |      |      ||       | / | / | / | /
        |______|______|______|______||_______|/  |/  |/  |/
        |      |      |      |      ||       |   |26//312/
     Q3 |  812 | 1023 |  30  | 501  || 2366  |  /|85/|54/
        |      |      |      |      ||       | / | / | /
        |______|______|______|______||_______|/  |/  |/
        |      |      |      |      ||       |   /187/
     Q4 |  927 | 1038 |  38  | 580  || 2583  |  /|69/
        |      |      |      |      ||       | / | /
        |======|======|======|======||=======|/  |/
        |      |      |      |      ||       |   /  
        | 3024 | 3838 | 113  | 1993 ||  8968 |  /    
        |      |      |      |      ||       | /      
        |___/__|_____/|___/__|_/____||___/___|/        
           /        /    /    /         /
Total Sales by location/item-'   Tot-sales by location









                                             Total sales by time
                                             (roll up by item)
                          ________________________________|____
                         /  Total sales/by item-time //   |   /|
                        /      /      /      /      //       / |
                       / 3364 / 3421 /  184 / 2486 //  9455 /  |
                      /      /      /      /      //       /   |
           y         /======/======/======/======//=======/|   |
          t  Chicago/      /      /      /      //       /||  /|
         c         /  854 /  882 /  89  /  623 //  2448 / || / |
                  /______/______/______/______//_______/ e||/  |
       N         /      /      /      /      //       /|m ||102|
      O New York/      /      /      /      //       / i  ||87/|
     I         / 1087 /  968 /   38 /  872 //  2965 / t| /|| / |
    T         /______/______/______/______//_______/   |/ ||/  |
   A         /      /      /      /      //       /|c  /25||108|
  C  Toronto/      /      /      /      //       / o  /|95||20/|
 O         /  818 /  746 /   43 /  591 //  2198 / l| / | /|| / |
L         /______/______/______/______//_______/   |/  |/ ||/  |
         /      /      /      /      //       /|s  |31 |28||115|
Vancouver      /      /      /      //       / e  /|20/|04||93/ 
       /      /      /      /      //       / l| / | / | /|| /  
      /_ent__/_comp_/_phone/_sec__//_______/ a |/  |/  |/ ||/   
      |      |      |      |      ||       |s  |23 |31 |30||    
   Q1 |  605 |  825 |  14  | 400  || 1844  |  /|97/|29/|54||   
      |      |      |      |      ||       | / | / | / | /||  
      |______|______|______|______||_______|/  |/  |/  |/ || 
      |      |      |      |      ||       |   |25 |32 /218/
   Q2 |  680 |  952 |  31  | 512  || 2175  |  /|21/|71/|02/
      |      |      |      |      ||       | / | / | / | /
      |______|______|______|______||_______|/  |/  |/  |/
      |      |      |      |      ||       |   |26//312/
   Q3 |  812 | 1023 |  30  | 501  || 2366  |  /|85/|54/
      |      |      |      |      ||       | / | / | /        
      |______|______|______|______||_______|/  |/  |/
      |      |      |      |      ||       |   /187/
   Q4 |  927 | 1038 |  38  | 580  || 2583  |  /|69/
      |      |      |      |      ||       | / | /
      |======|======|======|======||=======|/  |/
      |      |      |      |      ||       |   /             
      | 3024 | 3838 | 113  | 1993 ||  8968 |  /             
      |      |      |      |      ||       | /             
      |___/__|_____/|___/__|_/____||___/___|/         
         /        /    /    /         /
Total Sales by location/item   Tot-sales by location










                        ____________________________   .-------.
                       /  Total sales/by item-time /| /       /|
                      /      /      /      /      / |/       / |
                     / 3364 / 3421 /  184 / 2486 /  /       /  |
                    /      /      /      /      /  /_______/   |
            y      |------|------|------|------/   |       |   |
           t       ____________________________  /=======. |   |
          i Chicago      /      /      /      /|/       /| |  /|
         c       /  854 /  882 /  89  /  623 / /  2448 / | | / |
                /______/______/______/______/ /_______/  | |/  |
       N       /      /      /      /      / /       /|  | |102|
      O New York     /      /      /      / /       / |  | |87/|
     I       / 1087 /  968 /   38 /  872 / /  2965 /  | /| | / |
    T       /______/______/______/______/ /_______/   |/ | |/  |
   A       /      /      /      /      / /       /|   /25| |108|
  C Toronto      /      /      /      / /       / |  /|95| |20/|
 O       /  818 /  746 /   43 /  591 / /  2198 /  | / | /| | / |
L       /______/______/______/______/ /_______/   |/  |/ | |/  |  __
       /      /      /      /      / /       /|  /|31 |28| |115|   /|
 Vancouver   /      /      /      / /       / |  /|20/|04| |93/   / |
    /       /      /      /      / /       /  | / | / | /| | /   /  |
   /_ent__ /_comp_/_phone/_sec__/ /_______/   |/  |/  |/ |_|/   /   |
   |      |      |      |      | |       |   /|23 |31 |305   __/421 |
Q1 |  605 |  825 |  14  | 400  | | 1844  |  / |97/|29/|4| -.  | 55  /
   |      |      |      |      | |       | /  | / | / |/  /|  |    /
   |______|______|______|______| |_______|/   |/  |/  /  / |  |   /^
   |      |      |      |      | |       |   /|25 |327  /  |  |  / |
Q2 |  680 |  952 |  31  | 512  | | 2175  |  / |21/|1/ _/   |  | /  Grand
   |      |      |      |      | |       | /  | / |/  /|218/ _|/  Total
   |______|______|______|______| |_______|/   |/  /  / |02/      Sales
   |      |      |      |      | |       |   /|268  /  | /        rollup time
Q3 |  812 | 1023 |  30  | 501  | | 2366  |  / |5/ _/   |/         rollup loc
   |      |      |      |      | |       | /  |/  /|312/          rollup item
   |______|______|______|______| |_______|/   /  / |54/
   |      |      |      |      | |       |   /  /  | /
Q4 |  927 | 1038 |  38  | 580  | | 2583  |  / _/   |/
   |      |      |      |      |/|       | /  /|187/
   `------'------'------'------' `--------'  / |69/
    /                           /  /        /  | /    
   /______/______/______/______/  /_______ /   |/
   |      |      |      |      |  |       |    /             
   | 3024 | 3838 | 113  | 1993 |  |  8968 |   /
   |      |      |      |      |  |       |  /             
   |______|______|______|______|/ |       | /         
                                  `-------- 
      /        /    /    /         /
                                            Total sales by time

Total Sales by location/item   Tot-sales by location






  - Of course other aggregates can be added as well.

  - Every aggregate added can increase the processing time when
     new data arrives
    (in this case, new quarter, new location, new items)






                        _____________________________________
                       /  Total sales/by item-time //by time/|
                      /   v  /      /      /  v   //  v v  / |
                     / 3364 / 3421 /  184 / 2486 //  9455 /  | 
                    /      /      /      /      //       /   |
        y          /======/======/======/======//=======/|   |
       t   Chicago/      /      /      /      //       /||  /|
      c          /  854 /  882 /  89  /  623 //  2448 / || / |
                /______/______/______/______//_______/ e||/  |
    N          /      /      /      /      //       /|m ||102|
   O  New York/      /      /      /      //       / i  ||87/|
  I          / 1087 /  968 /   38 /  872 //  2965 / t| /|| / |
 T          /______/______/______/______//_______/   |/ ||/  |
A          /      /      /      /      //       /|c  /25||108|
C  Toronto/      /      /      /      //       / o  /|95||20/|
O        /  818 /  746 /   43 /  591 //  2198 / l| / | /|| / |
L       /______/______/______/______//_______/   |/  |/ ||/  |_
       /      /      /      /      //       /|s  |31 |28||115|/|
Vancouver    /      /      /      //       / e  /|20/|04||93// |
     /      /      /      /      //       / l| / | / | /|| //  |
    /_ent__/_comp_/_phone/_sec__//_______/ a |/  |/  |/ ||//   |
    |      |      |      |      ||       |s  |23 |31 |30||/421/
 Q1 |  605 |  825 |  14  | 400  || 1844  |  /|97/|29/|54|| 55/l
    |      |      |      |      ||       | / | / | / | /||  /a
    |______|______|______|______||_______|/  |/  |/  |/ || /t 
    |      |      |      |      ||       |   |25 |32 /218|/o 
 Q2 |  680 |  952 |  31  | 512  || 2175  |  /|21/|71/|02/ T 
    |      |      |      |      ||       | / | / | / | /   
    |______|______|______|______||_______|/  |/  |/  |/   
    |      |      |      |      ||       |   |26//312/    
 Q3 |  812 | 1023 |  30  | 501  || 2366  |  /|85/|54/    
    |      |      |      |      ||       | / | / | /    
    |______|______|______|______||_______|/  |/  |/    
    |      |      |      |      ||       |   /187/    
 Q4 |  927 | 1038 |  38  | 580  || 2583  |  /|69/    
    |      |      |      |      ||       | / | /
    |======|======|======|======||=======|/  |/
    |      |      |      |      ||       |   /               
    | 3024 | 3838 | 113  | 1993 ||  8968 |  /               
    |  total sales by loc-item  ||by loc | /               
    |______|______|______|______||_______|/         
    |======|======|======|======||=======      
    |      |      |      |      ||  /                     
    | 756  | 960  |  28  | 498  || /                     
    |______|______|______|______||/         
       /      /      /      / 
    Average Sales by location-item
     ("rolling up" by averaging along time)



                           Maximum Sales by location-time
                    ("rolling up" by taking Maximum along item)
                                                        /       
           y         .======.======.======.=======.-------.     
          t  Chicago/      /      /      /      //       / |    
         c         /  854 /  882 /  89  /  623 //  882  /  |    
                  /______/______/______/______//_______/   |    
       N         /      /      /      /      //       /|   |    
      O New York/      /      /      /      //       / |  /|    
     I         / 1087 /  968 /   38 /  872 //  1087 /  | / |    
    T         /______/______/______/______//_______/   |/  |    
   A         /      /      /      /      //       /|   /943|    
  C  Toronto/      /      /      /      //       / |  /|  /|    
 O         /  818 /  746 /   43 /  591 //  818  /  | / | / |    
L         /______/______/______/______//_______/   |/  |/  |    
         /      /      /      /      //       /|   |113|103|   
Vancouver      /      /      /      //       / |  /| 0/| 2/|  
       /      /      /      /      //       /  | / | / | / | 
  ITEM/_ent__/_comp_/_phone/_sec__//_______/   |/  |/  |/  |
      |      |      |      |      ||       |   |894|104|112/
   Q1 |  605 |  825 |  14  | 400  ||  825  |  /|  /| 8/| 9/
      |      |      |      |      ||       | / | / | / | /
T     |______|______|______|______||_______|/  |/  |/  |/
I     |      |      |      |      ||       |   |940|114/
M  Q2 |  680 |  952 |  31  | 512  ||  952  |  /|  /| 2/
E     |      |      |      |      ||       | / | / | /
      |______|______|______|______||_______|/  |/  |/
q     |      |      |      |      ||       |   |978/ 
t  Q3 |  812 | 1023 |  30  | 501  || 1023  |  /|  /
r     |      |      |      |      ||       | / | /
s     |______|______|______|______||_______|/  |/  
      |      |      |      |      ||       |   / 
   Q4 |  927 | 1038 |  38  | 580  || 1038  |  /   
      |      |      |      |      ||       | /
      |______|______|______|______||_______|/                                
                                                              



                           Minimum Sales by location-time
                  ("rolling up" by taking minimums along item)
                                                        /       
           y         .======.======.======.=======.-------.     
          t  Chicago/      /      /      /      //       / |    
         c         /  854 /  882 /  89  /  623 //   89  /  |    
                  /______/______/______/______//_______/   |    
       N         /      /      /      /      //       /|   |    
      O New York/      /      /      /      //       / |  /|    
     I         / 1087 /  968 /   38 /  872 //   38  /  | / |    
    T         /______/______/______/______//_______/   |/  |    
   A         /      /      /      /      //       /|   /64 |    
  C  Toronto/      /      /      /      //       / |  /|  /|    
 O         /  818 /  746 /   43 /  591 //   43  /  | / | / |    
L         /______/______/______/______//_______/   |/  |/  |    
         /      /      /      /      //       /|   |41 |59 |   
Vancouver      /      /      /      //       / |  /|  /|  /|  
       /      /      /      /      //       /  | / | / | / | 
  ITEM/_ent__/_comp_/_phone/_sec__//_______/   |/  |/  |/  |
      |      |      |      |      ||       |   |52 |45 |63 /
   Q1 |  605 |  825 |  14  | 400  ||   14  |  /|  /|  /|  /
      |      |      |      |      ||       | / | / | / | /
T     |______|______|______|______||_______|/  |/  |/  |/
I     |      |      |      |      ||       |   |58 |54 /
M  Q2 |  680 |  952 |  31  | 512  ||   31  |  /|  /|  /
E     |      |      |      |      ||       | / | / | /
      |______|______|______|______||_______|/  |/  |/
q     |      |      |      |      ||       |   |59 / 
t  Q3 |  812 | 1023 |  30  | 501  ||   30  |  /|  /
r     |      |      |      |      ||       | / | /
s     |______|______|______|______||_______|/  |/  
      |      |      |      |      ||       |   / 
   Q4 |  927 | 1038 |  38  | 580  ||   38  |  /   
      |      |      |      |      ||       | /
      |______|______|______|______||_______|/                                
                                                              



Etc.

So there are nearly an infinity of different rollups
that could be performed automatically (as preprocessing
so that the results are already computed when needed),
however it is not feasible to precompute all of them.

Therefore the choice of which rollups to precompute
is an important tuning parameter of a data warehouse.



                                                                                 


Stars, Snowflakes, and Constellations
   Model Schemas for multidimensional data

                                                                                 


Stars, Snowflakes, and Constellations
   Model Schemas for multidimensional data




Large central Fact table and a set of smaller dimension tables,

  - Sales example (fact=sale; dimensions=time,branch,item,location)           


                        STAR MODEL SCHEMA
                        =================

DIM_TBL: Time
.----.----.---.--.---.--.
|tkey| day|dow|mo|qtr|yr|
|----|----|---|--|---|--|
|   1|  24|mon| 7|  3|00|
|   2|  31|thu| 8|  3|00|
  .                                           DIM_TBL: Item  
  .                                          .--.---.---.---.----.
                                             |ik|inm|brd|typ|styp|
                                             |--|---|---|---|----|
            FACT TABLE: Sales                |41|jug|dow|gal|good|
           .----.----.----.----.----.----.   |67|rug|per|bth|good|
           |tkey|ikey|bkey|lkey|$sld|#sld|   .  
           |----|----|----|----|----|----|   .         
           |   2| d67| nw1| nd8| 349|   5|             
           |   1| a41| nw1| nd8|1426|  10|               
           |   1| a41| nw1| mn1|  98|   1|    DIM_TBL: Location
           |   1| d67| se3| mn1|   7| 276|   .---.---.---.--.---.
           |   2| d67| nw1| mn1| 692|   2|   |lk |str|cty|st|ctry
               .                             |---|---|---|--|---|
               .                             |mn1|elm|dl |MN|USA|
                                             |nd8|oak|mot|ND|USA|
                                               .           
DIM_TBL: Branch                                .
.----.---.----.
|bkey|bnm|btyp|
|nw1 |icy| whs|
|se3 |hot|stor|
  .
  .






                          SNOWFLAKE SCHEMA
                          ================

   - variant of star:

     - the dim tables are normalized
       (e.g., Location is split into Location and City)

   - reduces redundancy
   - better design


DIM_TBL: Time
.----.----.---.--.---.--.
|tkey| day|dow|mo|qtr|yr|
|----|----|---|--|---|--|
|   1|  24|mon| 7|  3|00|
|   2|  31|thu| 8|  3|00|
                                            DIM_TBL: Item  
                                           .--.---.---.---.----.
                                           |ik|inm|brd|typ|stky|
                                           |--|---|---|---|----|
            FACT TABLE: Sales              |41|jug|dow|gal|good|
           .----.----.----.----.----.----. |67|rug|per|bth|good|
           |tkey|ikey|bkey|lkey|$sld|#sld|      
           |----|----|----|----|----|----|             
           |   2| d67| nw1| nd8| 349|   5|             
           |   1| a41| nw1| nd8|1426|  10|               
           |   1| a41| nw1| mn1|  98|   1|  DIM_TBL: Location
           |   1| d67| se3| mn1|   7| 276| .---.---.---.
           |   2| d67| nw1| mn1| 692|   2| |lk |str|cky|
                                           |---|---|---|
                                           |mn1|elm|A  |
                                           |nd8|oak|B  |
                                                        
                                              
                                                    DTBL:Cty
                                               .--.---.--.----.
                                               |ck|cty|st|ctry|
                                               |--|---|--|----|
                                               | A|dl |MN|USA |
                                               | B|mot|ND|USA |
DIM_TBL: Branch
.----.---.----.
|bkey|bnm|btyp|
|----|---|----|
|nw1 |icy| whs|
|se3 |hot|stor|








                   GALAXY SCHEMA or FACT CONSTELLATION
                   ===================================


Multiple fact tbls share dimension tables (e.g., Sales and Shipments)


                                                      DTBL: supp
                                                       .--.----.
                                  DIM_TBL: Item        |sk|styp|
                                 .---.----.---.---.--. |--|----|
                                 |ik |inam|brd|typ|sk| | 6|good|
                                 |---|----|---|---|--| | 7|fair
                                 |a41|jug |dow|gal| 6|
                                 |d67|rug |per|bth| 6|
                              
DIM_TBL: Time                
.--.----.---.--.q.--.       
|tk| day|dow|mo|t|yr|      
|--|----|---|--|r|--|     
| 1|  24|mon| 7|3|00|  FACT TABLE: Sales
| 2|  31|thu| 8|3|00| .--.---.---.---.----.----.
  .                   |tk|ik |bk |lk |$sld|#sld|                       
  .                   |--|---|---|---|----|----|                      
                      | 2|d67|nw1|nd8| 349|   5|                 
                      | 1|a41|nw1|nd8|1426|  10|                 
DIM_TBL: Branch       | 1|a41|nw1|mn1|  98|   1| DTBL: Loc    
.--.---.----.         | 1|d67|se3|mn1|   7| 276| .---.---.--.
|bk|bnm|btyp|         | 2|d67|nw1|mn1| 692|   2| |lk |str|ck|
|--|---|----|                                    |---|---|--|
nw1|icy| whs|                                    |mn1|elm| A|
se3|hot|stor|                                    |nd8|oak| B|

              FACT_TBL: Shipment                   DTBL:Cty
             .---.--.--.--.--.-----.------.    .--.---.--.----.
             |ik |tk|sk|fl|tl|$cost|#shipd|    |ck|cty|st|ctry|
             |---|--|--|--|--|-----|------|    |--|---|--|----|
             |d67| 2| 6|dl|fg|   45|    31|    | A|dl |MN|USA |
             |d67| 1| 6|fg|mt|   87|     9|    | B|mot|ND|USA |
                .                          
                .                                                          
                                                                                        






Examples for Defining Star Schema:
                                                   __________
                                                  //by time /|
                                                 //        / |
                                                //  9455  /  |
                                               //________/   |
      y           .---------------------------//--------.|   |
     t    Chicago/      /      /      /      //       / ||  /|
    c           /  854 /  882 /  89  /  623 //  2448 /  || / |
               /______/______/______/______//_______/ e ||/  |
  C           /      /      /      /      //       /|m  ||102|
 O   New York/      /      /      /      //       / i  /||87/|
L           / 1087 /  968 /   38 /  872 //  2965 / t| / || / |
           /______/______/______/______//_______/   |/  ||/  |
          /      /      /      /      //       /|c  /25 ||108|
  Toronto/      /      /      /      //       / o  /|95 ||20/|
        /  818 /  746 /   43 /  591 //  2198 / l| / | / || / |
       /______/______/______/______//_______/   |/  |/  ||/  |_
      /      /      /      /      //       /|s  |31 |28 ||115|/|
Vancouver   /      /      /      //       / e  /|20/|04/||93// |
    /      /      /      /      //       / l| / | / | / || //  |
   /_ent__/_comp_/_phone/_sec__//_______/ a |/  |/  |/  ||//   |
   |      |      |      |      ||       |s  |23 |31 |30 |//421/l
Q1 |  605 |  825 |  14  | 400  || 1844  |  /|97/|29/|54//||55/a
   |      |      |      |      ||       | / | / | / | // || /t
   |______|______|______|______||_______|/  |/  |/  |//  ||/o 
   |      |      |      |      ||       |   |25 |32 //218||T 
Q2 |  680 |  952 |  31  | 512  || 2175  |  /|21/|71//|02/   
   |      |      |      |      ||       | / | / | // | /   
   |______|______|______|______||_______|/  |/  |//  |/   
   |      |      |      |      ||       |   |26 //312/    
Q3 |  812 | 1023 |  30  | 501  || 2366  |  /|85// 54/    
   |      |      |      |      ||       | / | //|  /    
   |______|______|______|______||_______|/  |// | /    
   |      |      |      |      ||       |   //187/    
Q4 |  927 | 1038 |  38  | 580  || 2583  |  /| 69/    
   |      |      |      |      ||       | //|  /
   |______|______|______|______||_______|// | /
                               ||-------|/  |/               
                               ||  8968 |   /               
                               ||by loc |  /               
                               ||       | /         
                               ||_______|/    






A star schema for this cube might be:

                                              DIM_TBL: Item        
                                        .--.----.---.---.--.  
                                        |ik|inam|brd|typ|st|     
                                        |--|----|---|---|--|   
                                        |e |ent |ACM|big|gd|    
                                        |c |comp|BRS|slo|gd|    
                                        |p |phon|ATT|bad|bd|  
                                        |s |sec |KIL|mid|gd| 

                                        DTBL: Loc    
                                        .--.---.---.--.-----.
                                        |lk|str|cty|st|total|
                                        |--|---|---|--|-----|
                                        |v |elm|van|BC| 8968|
                                        |t |oak|tor|OT|18769|
                                        |n | 45| NY|NY|31254|
                                        |c |112|chi|IL|21802|

DIM_TBL: Time         GRD_TOTAL
.--.---.--.------.    .-------.
|tk|day|mo| total|    |__42155|
|--|---|--|------|
| 1| 24|ma|  9455|    FACT_TBL: Sales
| 2| 31|ju| 10287|   .--.--.--.--.------------.
| 3| 15|au| 10820|   |tk|lk|ik|bk|Dollars_sold|
| 4| 04|de| 11593|   |--|--|--|--|------------|
                     | 1|v |e |w1|       605  |              
                     | 1|t |e |w1|       818  |              
DIM_TBL: Branch      | 1|n |e |w1|      1087  |
.--.---.----.        | 1|c |e |w1|       854  |
|bk|bnm|btyp|        | 2|v |e |w1|       680  |
|--|---|----|        | 2|t |e |w1|       894  |
|w1|WES| whs|        | 2|n |e |w1|      1130  |   
                     | 2|c |e |w1|       943  | 
                     | 3|v |e |w1|       812  |
                     | 3|t |e |w1|       940  | 
DIM: LOC_TIME        | 3|n |e |w1|      1034  |
.-------------.      | 3|c |e |w1|      1032  |
|tk|lk|Total  |      | 4|v |e |w1|       927  |
|--|--|-------|      | 4|t |e |w1|       978  |
| 1|v | 1844  |      | 4|n |e |w1|      1142  |
| 1|t | 2198  |      | 4|c |e |w1|      1129  |
| 1|n | 2965  |      | 1|v |c |w1|       825  |
| 1|c | 2448  |      | 1|t |c |w1|       746  |
| 2|v | 2175  |      | 1|n |c |w1|       968  |
| 2|t | 2397  |      | 1|c |c |w1|       882  |
| 2|n | 3120  |      | 2|v |c |w1|       952  |
| 2|c | 2595  |      | 2|t |c |w1|       769  |
| 3|v | 2366  |      | 2|n |c |w1|      1024  |
| 3|t | 2521  |      | 2|c |c |w1|       890  |
| 3|n | 3129  |      | 3|v |c |w1|      1023  |
| 3|c | 2804  |      | 3|t |c |w1|       795  |
| 4|v | 2583  |      | 3|n |c |w1|      1048  |
| 4|t | 2685  |      | 3|c |c |w1|       924  |
| 4|n | 3271  |      | 4|v |c |w1|      1038  |
| 4|c | 3054  |      | 4|t |c |w1|       864  |
                     | 4|n |c |w1|      1091  |
                     | 4|c |c |w1|       992  |
                     | 1|v |p |w1|        14  |
                     | 1|t |p |w1|        43  |
                     | 1|n |p |w1|        38  |
                     | 1|c |p |w1|        89  |
                       .
                       .
                       .

****************************************************************