The goals of this course include
to initiate graduate student's into data and database systems research and
to enhance graduate student's presentation skills of their research.
The GRADUATE SCHOOL, COLLEGE, TECHNICAL SCHOOL RELATIONSHIP
Universities, by definition, integrate research, teaching and service.
The Graduate school at a University has the primary responsibility for research.
A College has the primary responsibility for teaching.
A Vocational, Technical and Professional School has primary responsibility
for training in the use of specific existing tools of a trade, area or profession.
Don't be frightened away from research.
Even though 765 may be in your first graduate course, you have probably already
been doing a lot of research, so it won't be entirely new to you.
Research is just another word for active learning.
There is really little difference between active learning and research,
sometimes with the slight difference that,
early on, most concepts that you research
have been pre-researched by others, while,
later on, most concepts that you research
have not been pre-research by others
In both cases,
the instructor
advisor or guides,
the student
masters context, background and language of the area, and
developes new or improved solutions to questions and problems.
A good researcher always takes the point of view that
there is almost always a better way to do anything.
A good researcher questions the prevailing methods and challenge the
current practices in an attempt to find a better way.
I like to call it finding a new, killer idea -and then
taking the responsibility to prove that it is killer.
Additional reference material on most topics in this course can be found
on the web by doing a google (or Yahoo or Ask) search on the
appropriate keyword(s) and also by using the NDSU library.
The topics covered in this course will include:
Horizontal Data
Vertical Data
Query Processing
Data Mining
Transaction Processing
Recovery
Normalization
Distributed Database Systems
Security
INTRODUCTION
CENTRALITY OF DATA
Data are central to every computer program.
If no data, there is
no input,
no output,
no constant,
no variable.
It is hard to imagine a program in which there is no data?
Therefore, virtually all programs are data management programs and
therefore, virtually all computing involves data management.
However, not the all data in computer programs is RESIDUALIZED
RESIDUALIZED data is data stored and managed after the
termination of the program that generated it (for reuse later).
Database Management Systems (DBMSs) store and manage residualized data.
WHAT ARE SOME OF THE MAIN PROBLEMS WITH DATA
1. HUGE VOLUME EVERYONE HAS LOTS OF DATA TODAY!
Data are collected much faster than it can be process or managed.
NASA's Earth Observation System (EOS), alone, has collected over
15 petabytes of data already (15,000,000,000,000,000 bytes).
Most of it will never be use - ever!
Most of it will never be seen - ever!
Why not?
There's so much volume,
usefulness of some of it will never be discovered
SOLUTION: Reduce the volume and/or raise the information content thru
structuring, querying, filtering, mining, summarizing, aggregating...
That is the main task of Data and Database workers today!
Claude Shannon's info theory principle comes into play here:
More volume means less information.
^ . decision (1 bit? y/n)
| / \
| / \ decision support
VALUE / \
OF / \ knowledge
DATA / \
(INFO / \ information
LEVEL) / \
| / \ structured data
| / \
/___________________\ raw data
<---VOLUME of DATA--->
This is a simple example to illustrate the tradeoff
between information level and volume level:
For example, the following 2 phone books are constructed from the same
2 names (Steve, Dan) and 2 numbers (3479, 5683).
Which phone book is the more useful?
Which provides the more information?
BOOK-1 BOOK-2
------------- -------------
Name Phone Name Phone
Steve 3479 Steve 3479
Dan 5683 Dan 5683
Steve 5683
Dan 3479
The 2nd one has more volume but provides zero information.
REDUCING VOLUME WHILE
INCREASING INFORMATION CONTENT
may be the most important task in data and database system work today!
Another problem which needs attention is:
2. LACK OF STRUCTURE THERE'S TOO MUCH RAW, UNSTRUCTURED DATA
Previous example revisited:
S4tD59a3ve7sen68
is the previous phone data in totally raw, unstructured,
non-redundant form and it is useless. It needs structure.
How can data be structured?
HORIZONTAL STRUCTURING OF DATA:
Form it into HORIZONTAL "records" of "fields".
e.g., Employee records consisting of the horizontal concatenation of data fields
SSN, LastName, MiddleIntital, FirstName, Address, Gender, OfficeNumber, Salary
These horizontally structure records are then stored in a file (EMPLOYEE file?)
One can order the records in those files by sorting the records (clustering)
and/or provide meta-structures for quick access (pointers, indexes, hashs,..)
A Very important principle:
"Choose the sort attribute wisely!"
You have one chance to speed up access by sorting (clustering)
Seems obvious but it isn't to everyone. E.g.,
Casey Stengel (former manager of the New York Yankees) once said
"OK, you guys! Line up alphabetically according to your height!".
That cannot be done, in general
(unless the alphabetic and height orderings happens to be identical,
in which case the height and name are called "fully correlated" attributes.
Such correlations constitute important information!)
DATA MINING can be thought of as the process of finding correlations
relationships and other patterns in data.
Since you get just one physical ordering,
other ordering requires a "meta"-structure (e.g., an index).
Previous example re-revisited:
The phone book can be physically order by
PHONE NUMBER or by NAME
Name Phone Name Phone
---- ----- ---- -----
Steve 3479 Dan 5683
Dan 5683 Steve 3479
Most phone books are ordered on NAME,
If you need fast access by PHONE NUMBER,
you can build an index for it (which is an auxilliary file ordered on Phone#):
NAME ORDERED PHONE BOOK ( RID = Record IDentifier )
RID Name Phone
--- ----- ----
1 Dan 5683
2 Steve 3479
PHONE NUMBER INDEX
Phone RID
---- ---
3479 2
5679 1
STRUCTURING DATA
may be the second most important task in data and database system work today!
Another important task to be addressed in data systems work today is:
3. RESIDUALIZATION OF DATA
MUCH WELL-STRUCTURED DATA IS DISCARDED PREMATURELY
Databases are about storing data persistently, for later use.
age of infinite storage
RESIDUALIZING DATA
may be the third most important task in data and database system work today!
WHAT IS A DATABASE?
An integrated shared repository of operational data of interest to an enterprise
INTEGRATED: it must be the unification of several distinct files
SHARED: same data can be used by more than 1 user (concurrently)
REPOSITORY: implies "persistence".
OPERATIONAL DATA: data on accounts, parts, patients, students,
employees, genes, stock, pixels,...
Non-operational: I/O data, transient data in buffers,
queues...
ENTERPRISE: bank, warehouse, hospital, school, corp, gov agency, person..
WHAT IS A DATABASE MANAGEMENT SYSTEM (DBMS)
A program which organizes and manages access to residual data
Databases also contains METADATA (data on the data).
Metadata is non-user data which contains the descriptive information
about the data and database organization (Catalog data)
WHY USE A DATABASE?
COMPACTNESS (saves space - no paper files necessary)
EASE OF USE (less drudgery, more of the organizational and search
work done by the system; user specifies what, not how)
CENTRALIZED CONTROL (by DB Administrator (DBA) and by the CEO)
REDUCES REDUNDANCY (1 copy is enough, but then
concurrent use must be controlled)
NO INCONSISTENCIES (again, since there is only 1 copy necessary)
ENFORCE STANDARDS (corporate, dept, industry, national, international)
INTEGRITY CONSTRAINTS (automatically maintained)
(e.g., GENDER=male => MAIDEN_NAME=null)
BALANCE REQUIREMENTS (even conflicting requirements? DataBase Administrator
(DBA) can optimize for the whole company)
DATA INDEPENDENCE (occurs because applications are immune to
storage structure and access strategy changes.
Can change the storage structure without changing
the access programs and vice versa)
Almost all commerical databases today are HORIZONTAL
Data is formed into files of horizontal records of a common type.
So we will start by considering Horizontal databases.
HORIZONTAL DATA TERMINOLOGY
| stored (physical, on disk) |
FIELDS, RECORDS, FILES | logical (as viewed by user) |
| type (e.g., datatype) |
FIELDS, RECORDS, FILES | occurrences (instances) |
"TYPE" and "OCCURRENCE" of files, records and fields (stored and logical).
TYPE: defines structure and expected contents
(time-independent - changes only upon DB reorganization)
OCCURRENCE: actual data instances at a given time (time-dependent - changes
with every insert/delete/update)
STORED FIELD = smallest unit of stored data
.-----.
e.g., |Jones| is a "name" field occurrence
`-----' ==========
Name: Char 25 might be the metadata type of that occurrence.
====
STORED HORIZONTAL RECORD = named collection of related stored fields.
.-----.----.----------.-----.--.-----.
e.g., |Jones|John|412 Elm St|Fargo|ND|58102|
`-----'----'----------'-----'--'-----'
Employee: Lname(char 25), Fname(char 15), Address(char 20),
City(char 15), St(char 2), Zip(char 5)
might be the metadata type of that record.
STORED FILE = named collection of all occurrences of 1 type of stored record
|LNAME|FNAME| ADDRESS |CITY |ST| ZIP | < --- "employee file" type
|=====|=====|==========|=====|==|=====|
|Jones|John |412 Elm St|Fargo|ND|58102| < ---."employee file" occurrences
|-----|-----|----------|-----|--|-----| |
|Smith|James|415 Oak St|Mhd |MN|56560| < ---|
|-----|-----|----------|-----|--|-----| |
|Thom |Bob |12 Main St|Mhd |MN|56560| < ---|
|-----|-----|----------|-----|--|-----| |
|Trath|Phil |345 12thSt|Fargo|ND|58105| < ---'
`-----'-----'----------'-----'--'-----'
The "employee file" type is the "employee record" type
(+ possibly, some other type characteristics such as max-#-records..)
In todays storage device world, there is still really only linear storage space.
Therefore the 2-dimensional picture of a stored file is just a picture and
strictly speaking, not possible in physical storage media.
The more accurately description of the above store file
(as a linear sequence of stored records) is:
|Jones|John |412 Elm St|Fargo|ND|58102|Smith|James|415 Oak St|Mhd |MN|56560|Thom |Bob |12 Main St|Mhd |MN|56560|Trath|Phil |345 12thSt|Fargo|ND|58105|
Some day there may be truly 2-dimensional storage devices (or even 3-D devices)
such as holographic storage devices. At that time, the 2-D picture of a stored
file, above, may be an accurate depiction.
The way these entities are stored is not necessarily the same
as the way in which they are view or known to users.
They may be known to the users in various "logical" variations.
A simple example of a logical record coming from the physical employee record:
.-----.-.-----.--.
|Jones|J|Fargo|ND|
`-----'-'-----'--'
So we have
LOGICAL FIELD = smallest unit of logical data
LOGICAL RECORD= named collection of related logical fields.
LOGICAL FILE = named collection of occurrences of 1 type of logical record.
which may or may not correspond to the physical entities.
Unfortunately there is a lot of variation in terminology.
It will suffice to "equate" terms as follows in this course:
TERMINOLOGY
COMMON USAGE RELATIONAL MODEL TABULAR USAGE
------------ ---------------- -------------
File Relation Table
Record tuple row
field attribute column
When we need to be more careful we will use:
relation is a "set" of tuples
whereas a
table is a "sequence" of rows or records (has order)
tuple is a "set" of fields
whereas a
row or record is a "sequence" of fields (has order)
PHYSICAL DATA MODELS
(for conceptualizing (logically)
and storing (physically) data in a database)
HORIZONTAL MODELS for files of horizontal records
- processing is thru vertical "scans" (or "access-path-enhanced-scans")
RELATIONAL (simple flat unordered files or relations of records
of tuples of unordered field values)
TABULAR (ordered files of ordered fields)
INVERTED LIST (Tabular with an access paths (index?) on every field)
HIERARCHICAL (files with hierarchical links)
NETWORK (files with record chains)
OBJECT-RELATIONAL (Relational with "Large OBject" (LOBs) fields)
(attributes which point to or contain complex objects)
VERTICAL MODELS (for vertical vectors or trees of attribute values
processing is typically thru horizontal AND/OR programs)
BINARY STORAGE MODEL (Copeland ~1986) (uses vertical value and bit vectors)
BIT TRANSPOSE FILES (Wang ~1988) (used vertical bit files)
VIPER STRUCTURES (~1998) (used vertical bit vectors for data mining)
PREDICATE-TREEES or PTREES (vertical bit trees) (~1997)
A REVIEW OF HORIZONTAL DATA MODELS
RELATIONAL DATA MODEL
(only construct allowed is the simple flat "relation")
STUDENT-FILE COURSE FILE
|S#|SNAME |LCODE | |C#|CNAME|SITE|
|==|======|======| |==|=====|====|
|25|CLAY |NJ5101| |8 |DSDE |ND |
|32|THAISZ|NJ5102| |7 |CUS |ND |
|38|GOOD |FL6321| |6 |3UA |NJ |
|17|BAID |NY2091| |5 |3UA |ND |
|57|BROWN |NY2092| `-------------'
ENROLL FILE STATUS FILE
|S#|C#|GRADE| |LCODE |STATUS|
|==|==|=====| |======|======|
|32|8 | 89 | |NJ5101| 1 |
|32|7 | 91 | |NJ5102| 1 |
|25|7 | 68 | |FL6321| 4 |
|25|6 | 76 | |NY2091| 3 |
|32|6 | 62 | |NY2092| 3 |
|38|6 | 98 | `-------------'
|17|5 | 96 |
* STUDENT and COURSE represent entities.
* ENROLL file represents relationship between Students & Courses (many-many)
* STATUS represents a relationship between LCODE and STATUS (1-to-many)
HIERARCHICAL MODEL
entities = records,
relationships = links of records forming trees
(terminology in COMMON USAGE: files, records, fields)
EDUCATION example:
root type is STUDENT (with attributes S#, NAME, LOCATION),
dependent type is COURSE (with attributes C#, CNAME),
2nd-level dependent type ENROLLMENT (with attrs, GRADE, LOC)
____________ _____________ ___________
25|CLAY|OTBK 32|THAISZ|KNB 38|GOOD|GTR STUDENTS
__|____|____ __|______|___ __|____|___
| | |
___|_ ___|__ __|__
|7|CUS| |8|DSDE| |6|3UA|
___|_|___| .___|_|____| |_|___| COURSES
|6|3UA| | |7|CUS | | |
|__|__| | .___|_|____| | |
| | |6|3UA| | | |
| | |_|___| | | |
| _____ | | _|___ __|__
| |ND|68| | | |ND|89 NJ|98|
| |__|__| | | |_|___ __|__| ENROLLMENTS
_|___ _____| __|__
|NJ|76| |ND|62| |ND|91|
|__|__| |__|__| |__|__|
If the typical workload involves producing class lists
for students, this organization is optimal.
If the typical workload is producing course enrollment
lists for professors, it is very poor.
The point is, Hierarchical Data Model almost always favors
a particular workload category (at the expense of the others).
NETWORK MODEL
entities = records,
relationships and by owner-member chains (sets)
(many-to-many relationships easily represented)
EDUCATION example 2:
owner-member chains STUDENT-ENROLLMENT COURSE-ENROLLMENT
______________ ________________
25|CLAY|NJ5101|<. .<|32|THAISZ|NJ5102| STUDENTS
.------------- : : ----------------
. __ __ : : __ __ __ ^
`-->|68|-->|76|' `>|89|-->|91|->|62|' ENROLLMENTS
.>`--. >`--. .->`--' >`--: >--'v
: : : : : `. : : : :
: `--:- -:-:- - - -:-' : : :
: : : : `. : : :
: : : : : .<-' : :
: : `-:- - - --:-:- - -' :
: `--. : : : :
.-: - - - - -:-' : : :
| :_ _ _ _ `- - - - - -:-:-. :
^_______ `-. ________ : : ^ ______v_
8DSDE|ND|-. `<|7|CUS|ND|-:-' `|6|3UA|NJ| COURSES
-------- . -------- : --------
` - - - - - - -'
simple to insert (create new record and reset pointers),
delete (reset pointers),
update (always just 1 copy to worry about, NO REDUNDANCY)
network approach: fast processing,
complicated structure (usually requires data processing shop)
Again, it favors one workload type over others.
INVERTED LIST MODEL (TABULAR):
Flat Ordered Files
(like relational except there's intrinsic order visible to
user programs on both tuples and attributes)
Order is usually "arrival order", means each record is given a
unique "Relative Record Number" or RRN when it is inserted.
- RRNs never change (unless there is a database reorganization).
- Programs can access records by RRN.
Physical placement of records on pages is in RRN order
("clustered on RRN" so that application programs can
efficiently retrieve in RRN order.
Indexes, etc can be provided for other access paths (and orderings).
1st page-.
__________V_
|RID | S# |ST| STATE-INDEX
|====|====|==| .___._____.
| 1 | 25 |NJ|-._ |RID|STATE|
|----|----|--| \ |===|=====|
| 2 | 32 |NJ|--. \ .--|_3_|_FL__|
|----|----|--| `-\-/---|_2_|_NJ__|
| 3 | 38 |FL|------'----|_1_|_NJ__|
|----|----|--| __-|_4_|_NY__|
| 4 | 47 |NY|-------' _-|_5_|_NY__|
`------------' '
|
____________ |
| 5 | 57 |NY|-------'
|----|----|--|
|----|----|--|
|----|----|--|
|----|----|--|
|----|----|--|
|----|----|--|
`------------'
^
2nd page-'
OBJECT RELATIONAL MODEL
Object Relational Model (OR model) is like relational model except
repeating groups are allowed (many levels of repeating groups -
even nested repeating groups)
and
pointers to very complex structures are allowed
(LOBs for Large OBjects,
BLOBs for Binary Large OBjects, etc.
VERTICAL DATA MODELS
POWERPOINT SLIDES INTRODUCING THE CURRENT WORLD OF DATA AND
AN INTRODUCTION TO VERTICAL DATA (P-TREES): vertical data
WHAT ARE SOME OF THE IMPORTANT
NEW CATEGORIES OF DATA TODAY?
Business Intelligence Data
The Business Intelligence Problem can be described as the problem of determining
who is most likely to want what product, when?
This is obviously a critically important question to be able to answer for
any business today - especially web-based businesses.
Today, businesses cannot advertise to everyone
and even if they did, the fact that everyone gets the advertisement would
mean it is less likely to be effective.
Therefore targeted advertisers must answers the WHO WHAT WHEN problem:
How does a business determine:
who to advertise
what products to,
when?
That's not a matter of just querying a database?
It would be nice if we had database query languages that were high
enough level so we could just issue that query and get the answer,
but we are decades away from that
(maybe someone in this room will become a billionare developing that language?)
I can tell you that SQL is NOT that language or anything close.
The language will be a new
DATA MINING LANGUAGE
Data Mining Language is one of the very hottest topics these days,
but it is in it's infancy (those versed in language issues take note!).
Before a DM Language can be developed, Data Mining techniques need to
be much more highly developed, tested and understood.
We will concetrate a lot of this course on that issue.
How do we test data mining techniques?
Traditionally, BENCHMARK SUITES are developed by industry groups,
against which each new product is run to evaluate its performance.
Most Benchmarks Suites are created as more or less, toys
in size and realisticness (my opinion),
by the same people who will later run their products against those benchmarks
(so bias is always a concern)
and therefore the BENCHMARKS fail to be very useful to anyone
(Witnesss: Has any new database system every failed to win
on the TPC benchmarks for database queries?)
FORTUNATELY there is still hope for a legitimate and useful
Business intelligence Benchmark (my opinion).
It may come out of the NETFLIX PRIZE dataset and software.
Netflix may be the smartest company in the world today!
I base that on the fact that they have set a $1,000,000.00 prize for anyone
(team) that can improve their current business intelligence algorithm by at
least 10%.
They give 5 years to do it!
Over 100,000 of the best data mining scientists (and students) in the world
are devoting most of the energies to winning this contest right now
(over 25,000 teams registered!).
That's $10 per scientist! Try to hire a scientist for $10 ?
The current leader board shows an improvement of ~7-8 % already
so there is some success.
Each year (year 1 ends in October of 2007) they give $50,000 to the
LEADER BOARD leader.
If someone post a winning solution (bettering the Netflix algorithm by at least 10%,
ALL OTHER CONTESTANTS GET AN ADDITIONAL 30 DAYS TO BEAT THAT SCORE!
Can you imagine how much additional, intense effort they will get for free
during those 30 days?!?
That is so smart!!
The point really is though, that this dataset is a real life Business Intelligence
data set (massive and real, not toy!) and
the problem is THE Business intelligence problem!
It WILL develop into a Benchmark which will be used by everyone for decades!
You might consider developing your research paper/presentation problem
around the Netflix problem and data set
(Just a portion of the problem?
or one simple step toward the solution?
If you should end up solving the full problem,
you won't have to care about this course any more!)
I'd be happy to talk about the problem more and talk about paper topics that
might come out of the problem. Let me know if you are interested in that.
Medical Imaging Data
There are many technologies for medical imaging these days, including
XRay
Computed Tomography (CAT scans or CTA scans)
Magnetic Resonance Imaging (MRI scans)
Positron Emmision Tomography (PET scans)
Electromyography (EMG)
Electrocardiography (ECG or EKG)
Electroencephalography (EEG)
Magnetoencephalography (MEG)
just to name a few.
All of these produce an image which is processed into a data set with
many derived attributes (by software developed by experts in the field
to isolate the important characteristics of spots on the image and the
area surrounding spots on the image).
These derived data set must be data mined to determine evidence of desease
(Computer Aided Detection or CAD). This is a very hot topic these days.
A good benchmark is developing in this CAD data area which developed
out of the Association of Computing Machinery
Knowledge Discovery and Datamining Cup or KDD CUP
contest in 2006 ACM KDD Cup 2006
The NDSU team won the NPV Task of that contest, which was advertised as
THE HOLY GRAIL OF CAD
Here are some notes describing the work: PE CAD
Here you will find a paper describing each winner's work ACM SIGKDD Explorations, Dec, 2006
Market Basket Research (MBR) Data
(Which can be view horizontally or vertically)
Some interesting real-life FACTs:
---------------------------------
- Walmart lead the revolution (but in parallel with published results -
they keep all their discoveries secret - still do)
- they used MBR data mining for SUPPLY MANAGEMENT
- they used MBR data mining to know precisely
What quantity of each item to have at
Where (which stores),
When.
Proof that MBR data mining is important
---------------------------------------
- in 1990 Kmart was much larger than Walmart
- Kmart decided not to do data mining and
only started getting into it in the last few years
- Recently, two events occured
Kmart filed for bancrupcy protection
Walmart became the largest corporation in the world
So data management and mining is extremely useful/important!
Market basket data (MBR), the Set Model:
Cash Register Data at a retail store
A "transaction" or tuple in a MBR file is a customer at a
particular time, going through checkout with a "market basket"
or shopping cart full of items.
Data is organized as a table with 2 attributes
"Set Transaction Table": STT(Tid, Itemset)
Tid is transaction identifier (a particular market basket, and
Itemset is the set of items associated with Tid (the items in the basket).
Note: tuples are not "flat" (each itemset is a "set") (not RELATIONAL)
That's a problem for efficient mgmt and analysis, so typically
a change-over is made to a "dual model", the binary model:
"Binary Transaction Table": BTT(Tid, Item-1, Item-2,... Item-n)
Each Row is a customer transaction (1 cust thru checkout 1 time)
Each column is a particular Item (1 column for each item)
The value is either a 1-bit ( if item is in the basket)
or a 0-bit (item is not in the basket)
Items = precisely the items available for purchase
(in this analysis we don't care how many of an item the customer
has in his/her cart, just whether he/she has that item)
The SetTransTable, STT(Tid, ItemSet) can be
"transformed" to the binary model, BTT, but also it can be
"rotated" to the SetItemTable SIT(Iid, TransSet)
- SIT can be transformed to the binary model, BIT, also.
but we note it is just the transpose of BTT.
- This simple tranformation of the structure from a two
column non-relational model to the binary model set the stage for
a whole revolution in MBR data mining from 92-present!
STT(Tid,ItemSet) and BTT(Tid,Item1..Item-n) don't deal with item hierarchy
eg, milk is an item at 1 level, but breaks down into
skim, whole, 2%.. at the next level down the concept hierarchy...
Work on hierarchical Market Basket Research has just begun.
It would make a great research topic for this course.
Thus, in the MBR models, these "Binary Transaction Tables" are
- extremely wide (many many feature attributes)
- extremely sparse (mostly 0's - most customers buy just a few
items in one shopping trip!)
Bioinformatics/genetics data is remarkably similar to MBD
Microarray Data Analysis (MDA) is the analysis of the gene expression levels
of thousands of genes spotted on glass slides called a Microarrays and
subjected to various "treatments" or experiments
MDA is often stored as an Excel spreadsheet,
RealGeneTbl: RGT(Gid,E1...En)
- row = gene
- column = experiment (plus other columns)
- value = expression value or log ratio of expression values
BinaryGeneTbl: BGT(Gid,E1...En)
is the table you get by setting a threshold expression ratio
and recording 1 iff it is exceeded:
SetGeneTbl: SGT(Gid, ExpSet) is its tanslation to the set model.
SetExpTbl: SET(Eid, GeneSet) is obtained by rotating SGT
row = experiment,
col = Genes whose expr level exceeds threshold for that exp
BinaryExpTbl: BET(Eid, Gene1,...,GeneN) is similar to BTT in MBR
Microarray methodology animation Microarray methodology animation
A fairly good benchmark is developing in this MBR data area also which developed
out of the Association of Computing Machinery
Knowledge Discovery and Datamining Cup or KDD CUP
contest in 2002 ACM KDD Cup 2002
The NDSU team tie for second on Task 3 of that contest (actually
got the highest score on the "broad class problem" of that task).
Here is a paper describing each winner's and placer's work
ACM SIGKDD Explorations, Dec, 2002
Formally in MBR, the BTT is defined as follows:
I={i1..im} is the set of items.
- eg, an item for purchase in a store
- Each item in a store is an attribute, Ai,
- with Boolean values (1 = in a customer's "market basket
or shopping cart" and 0=not in it).
A is an itemsets (or Iset) if it is a subset of I,
- eg, a set of items for purchase from a store
D = {ti..tn} is the set of transactions
- each ti consists of an identifier and an itemset:
ti = ( t-id, t-itemset )
- eg, a customer going thru checkout,
the set of items in her/his cart
A transaction,t, SUPPORTS an itemset,A,
if A IS CONTAINED IN t-itemset.
Stock market analysis data
It is not even clear what data should be considered
"Stock Market Analysis Data" since it is not clear what are
the best "signals" for future stock rise or fall?
This is an exciting area in which to work (but difficult).
Those who have strong backgrounds in financial data and the
stock market are invited to do a project in this area.
Signal data (scientific)
A signal can be thought of as any function
f:{0,1,2,3...} -- > Real or Complex numbers
or f:Real_numbers -- > Real or Complex numbers
The first is referred to as a discrete signal and the second
as a continuous signal (domain is often taken to be time axis).
Signals can also be defined on the Complex number space:
f:Complex_numbers -- > Real or Complex numbers
Signals can certainly be defined on a Cartesian product of any
or all of the domains above, as well.
Signal data comes from many many sources (sensors), including
satellite platforms,
aerial platforms,
radar facilities,
fixed position ground sensors and many more,
nano-sensors,
etc.
Signal processing is usually 1st step in making this data useful
But typically it is not enough.
Alien Technologies is building a 50 Million dollar plant 3 blocks from here
(in the NDSU Research Technology Park - just west 18th ST and south of 19 Ave.)
in which they will produce billions of Radio Frequency ID tags (RFid tags) for
Walmart, Gillette and etc.
These nano-scale passive RF tags will be embedded in every pallete that
Walmart and Gillette etc. ship. That way an active RF sensor can be passed
within 10 feet of the pallete and read the information (like bar code info).
The building just east and north of Phoenix Engineering
houses the Center for Nano-Scale Science and Engineering or CNSE
CNSE
where research on nano-scale products is being conduted for the defense
department (in consort with Alien Corp.).
VLSI design data
The VLSI component testing process involves massive arrays of
binary data (hundreds of rows and billions of columns)
and a hierarchical structure of subcomponents.
Data Warehouse (multidimensional) data
data warehouse slides (DW)
The following is a text based treatment of the topic. It will not be
lectured over but is included for an additional treatment of the subject
and for review and practice.
MULTIDIMENSIONAL DATA
(roughly, it's data with more than 1 natural "keys" or look up attributes)
Multidimnesional data comes up in the context of Datawarehouses
- A datawarehouse (DW) is a database which is typically:
- static (no updates, only inserts of new data)
- read only
- In a datawarehouse, at the end of each business cylce
(end of the day - after the data is no longer changing
due to intra-day activity) a copy of the relevant data
is appended to the DW with the proper timestamp.
- Knowledge workers can analyze the data without having to wait
for updates to be installed (no concurrency control required)
Multidimensional data is often organized into a Data Cube.
- A data cube allows data to be modeled and viewed in multiple dimensions.
- It is defined by dimensions and facts.
- The Fact is the central theme of the organization and
instances of the fact are contained in a FACT TABLE
- the Fact table contains key attributes (one for each dimension
that defines the fact) and descriptive attributes of the
facts (measurements).
- Dimensions are perspectives from which to view facts
- combinations of dimension keys are keys for the facts
- Dimensions records are kept in a dimension files.
Example: AllElectronics Sales:
A fact is a sale
Dimensions are time (Q1,Q2,Q3,Q4),
item (home-ent,computer,phone,security) and
location(Chicago,New_York,Toronto,Vancouver)
Relational Model (4 relations, 1 for each location, with fixed
records, Q1,Q2,Q3,Q4 and identical attributes, ent,com,ph,sec
Loc="Chicago" Loc="New York" Loc="Toronto" Loc="Vancouver"
ent com ph sec ent com ph sec ent com ph sec ent com ph sec
Q1 854 882 89 623 1087 968 38 872 818 746 43 591 605 825 14 400
Q2 943 890 64 698 1130 1024 41 925 894 769 52 682 680 952 31 512
Q3 1032 924 59 789 1034 1048 45 1002 940 795 58 728 812 1023 30 501
Q4 1129 992 63 870 1142 1091 54 984 978 864 59 784 927 1038 38 580
Note that ent,com,ph,sec are not really "descriptive attributes" of quarters
and quarters are not really tuple instances since they are always the same 4.
A better model is the
Star Model (3-D Fact Cube of sales by time/item/location:
____________________________
y / / / / /|
t Chicago/ / / / / |
c / 854 / 882 / 89 / 623 / |
/______/______/______/______/ |
N / / / / /| |
O New York/ / / / / | /|
I / 1087 / 968 / 38 / 872 / | / |
T /______/______/______/______/ |/ 8|
A / / / / /| / 9 |
C Toronto/ / / / / | /|6 /|
O / 818 / 746 / 43 / 591 / | / | / |
L /______/______/______/______/ |/ 5|/ 9|
/ / / / /| | 2 | 8 |
Vancouver/ / / / / | /|9 /|7 /|
/ / / / / | / | / 2 / |
/______/______/______/______/ |/ 2|/ 0|/ 0|
| | | | | | 8 | 0 | 7 /
T Q1 | 605 | 825 | 14 | 400 | /|6 /|1 /|8 /
I | | | | | / | / | / | /
M |______|______|______|______|/ |/ 8|/ 4|/
E | | | | | | 2 | 8 /
Q2 | 680 | 952 | 31 | 512 | /|7 /|9 /
q | | | | | / | / | /
t |______|______|______|______|/ |/ 4|/
r | | | | | | 8 /
s Q3 | 812 | 1023 | 30 | 501 | /|7 /
| | | | | / | /
|______|______|______|______|/ |/
| | | | | /
Q4 | 927 | 1038 | 38 | 580 | /
| | | | | /
|______|______|______|______|/
home-ent comp phone security
ITEM
Then each dimension (ITEM, TIME, LOCATION) is fully described in
a separate table (dimension table.
If there are 4 dimensions (add supplier), the
4-D Fact Cube of sales by time/item/location/supplier
can be shown as a series of 3-D cubes:
supplier= "SUP1"
___________________
C Chicago/ / / / /|
O /____/____/____/___/ |
L NY/ / / / /| |
/____/____/____/___/ |/|
Toronto/ / / / /| / |
/____/____/____/___/ |/|/|
Vancover / / / /| | | |
/____/____/____/___/ |/|/|/|
Q1|605 |825 | 14 |400| | | | /
T |____|____|____|___|/|/|/|/
I Q2| | | | | | | / supplier= "SUP2"
M |____|____|____|___|/|/|/ ________________
E Q3| | | | | | / / / / / /|
|____|____|____|___|/|/ /___/___/___/___/ |
Q4| | | | | / / / / / /| |
|____|____|____|___|/ /___/___/___/___/ |/|
ent comp ph sec / / / / /| / |
I T E M /___/___/___/___/ |/|/|
/ / / / /| | | |
/___/___/___/___/ |/|/|/|
| | | | | | | | /
|___|___|___|___|/|/|/|/
| | | | | | | / supplier="SUP3"
|___|___|___|___|/|/|/ ________________
| | | | | | / / / / / /|
|___|___|___|___|/|/ /___/___/___/___/ |
| | | | | / / / / / /| |
|___|___|___|___|/ /___/___/___/___/ |/|
/ / / / /| / |
/___/___/___/___/ |/|/|
/ / / / /| | | |
/___/___/___/___/ |/|/|/|
| | | | | | | | /
|___|___|___|___|/|/|/|/
| | | | | | | /
|___|___|___|___|/|/|/
| | | | | | /
|___|___|___|___|/|/
| | | | | /
|___|___|___|___|/
Cubes are simply a visual representation to show
how the data can be understood.
Multidimensional data is usually stored
along with precomputed aggregates of high interest
(sums, averages,...) to the knowledge worker using the
data warehouse, such as sums, averages, etc.
These aggregates are "rolled up" and stored as part of the cube
as follows:
Adding Total Sales by location-time
("rolling up" by summing along item)
/ :
y .======.======.======.=======.-------. /:
t Chicago/ / / / // / | / :
c / 854 / 882 / 89 / 623 // 2448 / | :
/______/______/______/______//_______/ | :
N / / / / // /| | /:
O New York/ / / / // / | /| / :
I / 1087 / 968 / 38 / 872 // 2965 / | / | :
T /______/______/______/______//_______/ |/ | :
A / / / / // /| /25 | /:
C Toronto/ / / / // / | /|95/| / :
O / 818 / 746 / 43 / 591 // 2198 / | / | / | :
L /______/______/______/______//_______/ |/ |/ | :
/ / / / // /| |31 |28 | /
Vancouver / / / // / | /|20/|04/| /
/ / / / // / | / | / | / |
ITEM/_ent__/_comp_/_phone/_sec__//_______/ |/ |/ |/ |
| | | | || | |23 |31 |30 /
Q1 | 605 | 825 | 14 | 400 || 1844 | /|97/|29/|54/
| | | | || | / | / | / | /
T |______|______|______|______||_______|/ |/ |/ |/
I | | | | || | |25 |32 /
M Q2 | 680 | 952 | 31 | 512 || 2175 | /|21/|71/
E | | | | || | / | / | /
|______|______|______|______||_______|/ |/ |/
q | | | | || | |26 /
t Q3 | 812 | 1023 | 30 | 501 || 2366 | /|85/
r | | | | || | / | /
s |______|______|______|______||_______|/ |/
| | | | || | /
Q4 | 927 | 1038 | 38 | 580 || 2583 | /
| | | | || | /
|______|______|______|______||_______|/
Total Sales by location-time
/
y .======.======.======.=======.-------.
t Chicago/ / / / // / |
c / 854 / 882 / 89 / 623 // 2448 / |
/______/______/______/______//_______/ |
N / / / / // /| |
O New York/ / / / // / | /|
I / 1087 / 968 / 38 / 872 // 2965 / | / |
T /______/______/______/______//_______/ |/ |
A / / / / // /| /25 |
C Toronto/ / / / // / | /|95/|
O / 818 / 746 / 43 / 591 // 2198 / | / | / |
L /______/______/______/______//_______/ |/ |/ |
/ / / / // /| |31 |28 |
Vancouver/ / / / // / | /|20/|04/|
/ / / / // / | / | / | / |
ITEM-> /_ent__/_comp_/_phone/_sec__//_______/ |/ |/ |/ |
| | | | || | |23 |31 |30 /
Q1 | 605 | 825 | 14 | 400 || 1844 | /|97/|29/|54/
| | | | || | / | / | / | /
|______|______|______|______||_______|/ |/ |/ |/
| | | | || | |25 |32 /
Q2 | 680 | 952 | 31 | 512 || 2175 | /|21/|71/
| | | | || | / | / | /
|______|______|______|______||_______|/ |/ |/
| | | | || | |26 /
Q3 | 812 | 1023 | 30 | 501 || 2366 | /|85/
| | | | || | / | /
|______|______|______|______||_______|/ |/
| | | | || | /
Q4 | 927 | 1038 | 38 | 580 || 2583 | /
| | | | || | /
|======|======|======|======='-------'
| | | | | /
| 3024 | 3838 | 113 | 1993 | /
| | | | | /
|______|______|______|______|/
Adding Total Sales by location-item
("rolling up" by summing along time)
Total Sales by location-time
/
y .======.======.======.=======.-------.
t Chicago/ / / / // / |
c / 854 / 882 / 89 / 623 // 2448 / |
/______/______/______/______//_______/ |
N / / / / // /| |
O New York/ / / / // / | /|
I / 1087 / 968 / 38 / 872 // 2965 / | / |
T /______/______/______/______//_______/ |/ |
A / / / / // /| /25 |
C Toronto/ / / / // / | /|95/|
O / 818 / 746 / 43 / 591 // 2198 / | / | / |
L /______/______/______/______//_______/ |/ |/ |
/ / / / // /| |31 |28 |
Vancouver/ / / / // / | /|20/|04/|
/ / / / // / | / | / | / |
----> /_ent__/_comp_/_phone/_sec__//_______/ |/ |/ |/ |
| | | | || | |23 |22 |30/|
Q1 | 605 | 825 | 14 | 400 || 1834 | /|67/|29/|54/|
| | | | || | / | / | / |// |
|______|______|______|______||_______|/ |/ |/ |/ |
| | | | || | |25 |32//218/
Q2 | 680 | 952 | 31 | 512 || 2175 | /|21/|71/|02/
| | | | || | / | / |// | /
|______|______|______|______||_______|/ |/ |/ |/
| | | | || | |26//312/
Q3 | 812 | 1023 | 30 | 501 || 2366 | /|85/|54/
| | | | || | / |// | /
|______|______|______|______||_______|/ |/ |/
| | | | || | //187/
Q4 | 927 | 1038 | 38 | 580 || 2583 | //|69/
| | | | ||_______|// | /
|======|======|======|======||=======|/ |/
| | | | ||| | /
-----> | 3024 | 3838 | 113 | 1993 ||| 8968 | /
| | | | ||| | /
|______|______|______|______|||__/___|/
/
Total Sales by location-item Adding Tot-sales by location
("rolling up" by summing time)
(equiv, rolling right on item)
Adding Total Sales by item-time
("rolling up" by summing along location)
__|______|______|______|____
/ | / | / | / | /|
/ / / / / |
/ 3364 / 3421 / 184 / 2486 / |
/ / / / / |
y /======/======/======/=======.--------
t Chicago/ / / / // / |
c / 854 / 882 / 89 / 623 // 2448 / |
/______/______/______/______//_______/ e |
N / / / / // /|m |
O New York/ / / / // / i /|
I / 1087 / 968 / 38 / 872 // 2965 / t| / |
T /______/______/______/______//_______/ |/ |
A / / / / // /|c /25 |
C Toronto/ / / / // / o /|95/|
O / 818 / 746 / 43 / 591 // 2198 / l| / | / |
L /______/______/______/______//_______/ |/ |/ |
/ / / / // /|s |31 |28 |
Vancouver/ / / / // / e /|20/|04/|
/ / / / // / l| / | / | / |
----> /_ent__/_comp_/_phone/_sec__//_______/ a |/ |/ |/ |
| | | | || |s |23 |31 |30 |
Q1 | 605 | 825 | 14 | 400 || 1844 | /|97/|29/|54/|
| | | | || | / | / | / | / |
|______|______|______|______||_______|/ |/ |/ |/ |
| | | | || | |25 |32 /218/
Q2 | 680 | 952 | 31 | 512 || 2175 | /|21/|71/|02/
| | | | || | / | / | / | /
|______|______|______|______||_______|/ |/ |/ |/
| | | | || | |26//312/
Q3 | 812 | 1023 | 30 | 501 || 2366 | /|85/|54/
| | | | || | / | / | /
|______|______|______|______||_______|/ |/ |/
| | | | || | /187/
Q4 | 927 | 1038 | 38 | 580 || 2583 | /|69/
| | | | || | / | /
|======|======|======|======||=======|/ |/
| | | | || | /
| 3024 | 3838 | 113 | 1993 || 8968 | /
| | | | || | /
|___/__|_____/|___/__|_/____||___/___|/
/ / / / /
Total Sales by location/item-' Tot-sales by location
Total sales by time
(roll up by item)
________________________________|____
/ Total sales/by item-time // | /|
/ / / / // / |
/ 3364 / 3421 / 184 / 2486 // 9455 / |
/ / / / // / |
y /======/======/======/======//=======/| |
t Chicago/ / / / // /|| /|
c / 854 / 882 / 89 / 623 // 2448 / || / |
/______/______/______/______//_______/ e||/ |
N / / / / // /|m ||102|
O New York/ / / / // / i ||87/|
I / 1087 / 968 / 38 / 872 // 2965 / t| /|| / |
T /______/______/______/______//_______/ |/ ||/ |
A / / / / // /|c /25||108|
C Toronto/ / / / // / o /|95||20/|
O / 818 / 746 / 43 / 591 // 2198 / l| / | /|| / |
L /______/______/______/______//_______/ |/ |/ ||/ |
/ / / / // /|s |31 |28||115|
Vancouver / / / // / e /|20/|04||93/
/ / / / // / l| / | / | /|| /
/_ent__/_comp_/_phone/_sec__//_______/ a |/ |/ |/ ||/
| | | | || |s |23 |31 |30||
Q1 | 605 | 825 | 14 | 400 || 1844 | /|97/|29/|54||
| | | | || | / | / | / | /||
|______|______|______|______||_______|/ |/ |/ |/ ||
| | | | || | |25 |32 /218/
Q2 | 680 | 952 | 31 | 512 || 2175 | /|21/|71/|02/
| | | | || | / | / | / | /
|______|______|______|______||_______|/ |/ |/ |/
| | | | || | |26//312/
Q3 | 812 | 1023 | 30 | 501 || 2366 | /|85/|54/
| | | | || | / | / | /
|______|______|______|______||_______|/ |/ |/
| | | | || | /187/
Q4 | 927 | 1038 | 38 | 580 || 2583 | /|69/
| | | | || | / | /
|======|======|======|======||=======|/ |/
| | | | || | /
| 3024 | 3838 | 113 | 1993 || 8968 | /
| | | | || | /
|___/__|_____/|___/__|_/____||___/___|/
/ / / / /
Total Sales by location/item Tot-sales by location
____________________________ .-------.
/ Total sales/by item-time /| / /|
/ / / / / |/ / |
/ 3364 / 3421 / 184 / 2486 / / / |
/ / / / / /_______/ |
y |------|------|------|------/ | | |
t ____________________________ /=======. | |
i Chicago / / / /|/ /| | /|
c / 854 / 882 / 89 / 623 / / 2448 / | | / |
/______/______/______/______/ /_______/ | |/ |
N / / / / / / /| | |102|
O New York / / / / / / | | |87/|
I / 1087 / 968 / 38 / 872 / / 2965 / | /| | / |
T /______/______/______/______/ /_______/ |/ | |/ |
A / / / / / / /| /25| |108|
C Toronto / / / / / / | /|95| |20/|
O / 818 / 746 / 43 / 591 / / 2198 / | / | /| | / |
L /______/______/______/______/ /_______/ |/ |/ | |/ | __
/ / / / / / /| /|31 |28| |115| /|
Vancouver / / / / / / | /|20/|04| |93/ / |
/ / / / / / / | / | / | /| | / / |
/_ent__ /_comp_/_phone/_sec__/ /_______/ |/ |/ |/ |_|/ / |
| | | | | | | /|23 |31 |305 __/421 |
Q1 | 605 | 825 | 14 | 400 | | 1844 | / |97/|29/|4| -. | 55 /
| | | | | | | / | / | / |/ /| | /
|______|______|______|______| |_______|/ |/ |/ / / | | /^
| | | | | | | /|25 |327 / | | / |
Q2 | 680 | 952 | 31 | 512 | | 2175 | / |21/|1/ _/ | | / Grand
| | | | | | | / | / |/ /|218/ _|/ Total
|______|______|______|______| |_______|/ |/ / / |02/ Sales
| | | | | | | /|268 / | / rollup time
Q3 | 812 | 1023 | 30 | 501 | | 2366 | / |5/ _/ |/ rollup loc
| | | | | | | / |/ /|312/ rollup item
|______|______|______|______| |_______|/ / / |54/
| | | | | | | / / | /
Q4 | 927 | 1038 | 38 | 580 | | 2583 | / _/ |/
| | | | |/| | / /|187/
`------'------'------'------' `--------' / |69/
/ / / / | /
/______/______/______/______/ /_______ / |/
| | | | | | | /
| 3024 | 3838 | 113 | 1993 | | 8968 | /
| | | | | | | /
|______|______|______|______|/ | | /
`--------
/ / / / /
Total sales by time
Total Sales by location/item Tot-sales by location
- Of course other aggregates can be added as well.
- Every aggregate added can increase the processing time when
new data arrives
(in this case, new quarter, new location, new items)
_____________________________________
/ Total sales/by item-time //by time/|
/ v / / / v // v v / |
/ 3364 / 3421 / 184 / 2486 // 9455 / |
/ / / / // / |
y /======/======/======/======//=======/| |
t Chicago/ / / / // /|| /|
c / 854 / 882 / 89 / 623 // 2448 / || / |
/______/______/______/______//_______/ e||/ |
N / / / / // /|m ||102|
O New York/ / / / // / i ||87/|
I / 1087 / 968 / 38 / 872 // 2965 / t| /|| / |
T /______/______/______/______//_______/ |/ ||/ |
A / / / / // /|c /25||108|
C Toronto/ / / / // / o /|95||20/|
O / 818 / 746 / 43 / 591 // 2198 / l| / | /|| / |
L /______/______/______/______//_______/ |/ |/ ||/ |_
/ / / / // /|s |31 |28||115|/|
Vancouver / / / // / e /|20/|04||93// |
/ / / / // / l| / | / | /|| // |
/_ent__/_comp_/_phone/_sec__//_______/ a |/ |/ |/ ||// |
| | | | || |s |23 |31 |30||/421/
Q1 | 605 | 825 | 14 | 400 || 1844 | /|97/|29/|54|| 55/l
| | | | || | / | / | / | /|| /a
|______|______|______|______||_______|/ |/ |/ |/ || /t
| | | | || | |25 |32 /218|/o
Q2 | 680 | 952 | 31 | 512 || 2175 | /|21/|71/|02/ T
| | | | || | / | / | / | /
|______|______|______|______||_______|/ |/ |/ |/
| | | | || | |26//312/
Q3 | 812 | 1023 | 30 | 501 || 2366 | /|85/|54/
| | | | || | / | / | /
|______|______|______|______||_______|/ |/ |/
| | | | || | /187/
Q4 | 927 | 1038 | 38 | 580 || 2583 | /|69/
| | | | || | / | /
|======|======|======|======||=======|/ |/
| | | | || | /
| 3024 | 3838 | 113 | 1993 || 8968 | /
| total sales by loc-item ||by loc | /
|______|______|______|______||_______|/
|======|======|======|======||=======
| | | | || /
| 756 | 960 | 28 | 498 || /
|______|______|______|______||/
/ / / /
Average Sales by location-item
("rolling up" by averaging along time)
Maximum Sales by location-time
("rolling up" by taking Maximum along item)
/
y .======.======.======.=======.-------.
t Chicago/ / / / // / |
c / 854 / 882 / 89 / 623 // 882 / |
/______/______/______/______//_______/ |
N / / / / // /| |
O New York/ / / / // / | /|
I / 1087 / 968 / 38 / 872 // 1087 / | / |
T /______/______/______/______//_______/ |/ |
A / / / / // /| /943|
C Toronto/ / / / // / | /| /|
O / 818 / 746 / 43 / 591 // 818 / | / | / |
L /______/______/______/______//_______/ |/ |/ |
/ / / / // /| |113|103|
Vancouver / / / // / | /| 0/| 2/|
/ / / / // / | / | / | / |
ITEM/_ent__/_comp_/_phone/_sec__//_______/ |/ |/ |/ |
| | | | || | |894|104|112/
Q1 | 605 | 825 | 14 | 400 || 825 | /| /| 8/| 9/
| | | | || | / | / | / | /
T |______|______|______|______||_______|/ |/ |/ |/
I | | | | || | |940|114/
M Q2 | 680 | 952 | 31 | 512 || 952 | /| /| 2/
E | | | | || | / | / | /
|______|______|______|______||_______|/ |/ |/
q | | | | || | |978/
t Q3 | 812 | 1023 | 30 | 501 || 1023 | /| /
r | | | | || | / | /
s |______|______|______|______||_______|/ |/
| | | | || | /
Q4 | 927 | 1038 | 38 | 580 || 1038 | /
| | | | || | /
|______|______|______|______||_______|/
Minimum Sales by location-time
("rolling up" by taking minimums along item)
/
y .======.======.======.=======.-------.
t Chicago/ / / / // / |
c / 854 / 882 / 89 / 623 // 89 / |
/______/______/______/______//_______/ |
N / / / / // /| |
O New York/ / / / // / | /|
I / 1087 / 968 / 38 / 872 // 38 / | / |
T /______/______/______/______//_______/ |/ |
A / / / / // /| /64 |
C Toronto/ / / / // / | /| /|
O / 818 / 746 / 43 / 591 // 43 / | / | / |
L /______/______/______/______//_______/ |/ |/ |
/ / / / // /| |41 |59 |
Vancouver / / / // / | /| /| /|
/ / / / // / | / | / | / |
ITEM/_ent__/_comp_/_phone/_sec__//_______/ |/ |/ |/ |
| | | | || | |52 |45 |63 /
Q1 | 605 | 825 | 14 | 400 || 14 | /| /| /| /
| | | | || | / | / | / | /
T |______|______|______|______||_______|/ |/ |/ |/
I | | | | || | |58 |54 /
M Q2 | 680 | 952 | 31 | 512 || 31 | /| /| /
E | | | | || | / | / | /
|______|______|______|______||_______|/ |/ |/
q | | | | || | |59 /
t Q3 | 812 | 1023 | 30 | 501 || 30 | /| /
r | | | | || | / | /
s |______|______|______|______||_______|/ |/
| | | | || | /
Q4 | 927 | 1038 | 38 | 580 || 38 | /
| | | | || | /
|______|______|______|______||_______|/
Etc.
So there are nearly an infinity of different rollups
that could be performed automatically (as preprocessing
so that the results are already computed when needed),
however it is not feasible to precompute all of them.
Therefore the choice of which rollups to precompute
is an important tuning parameter of a data warehouse.
Stars, Snowflakes, and Constellations
Model Schemas for multidimensional data
Stars, Snowflakes, and Constellations
Model Schemas for multidimensional data
Large central Fact table and a set of smaller dimension tables,
- Sales example (fact=sale; dimensions=time,branch,item,location)
STAR MODEL SCHEMA
=================
DIM_TBL: Time
.----.----.---.--.---.--.
|tkey| day|dow|mo|qtr|yr|
|----|----|---|--|---|--|
| 1| 24|mon| 7| 3|00|
| 2| 31|thu| 8| 3|00|
. DIM_TBL: Item
. .--.---.---.---.----.
|ik|inm|brd|typ|styp|
|--|---|---|---|----|
FACT TABLE: Sales |41|jug|dow|gal|good|
.----.----.----.----.----.----. |67|rug|per|bth|good|
|tkey|ikey|bkey|lkey|$sld|#sld| .
|----|----|----|----|----|----| .
| 2| d67| nw1| nd8| 349| 5|
| 1| a41| nw1| nd8|1426| 10|
| 1| a41| nw1| mn1| 98| 1| DIM_TBL: Location
| 1| d67| se3| mn1| 7| 276| .---.---.---.--.---.
| 2| d67| nw1| mn1| 692| 2| |lk |str|cty|st|ctry
. |---|---|---|--|---|
. |mn1|elm|dl |MN|USA|
|nd8|oak|mot|ND|USA|
.
DIM_TBL: Branch .
.----.---.----.
|bkey|bnm|btyp|
|nw1 |icy| whs|
|se3 |hot|stor|
.
.
SNOWFLAKE SCHEMA
================
- variant of star:
- the dim tables are normalized
(e.g., Location is split into Location and City)
- reduces redundancy
- better design
DIM_TBL: Time
.----.----.---.--.---.--.
|tkey| day|dow|mo|qtr|yr|
|----|----|---|--|---|--|
| 1| 24|mon| 7| 3|00|
| 2| 31|thu| 8| 3|00|
DIM_TBL: Item
.--.---.---.---.----.
|ik|inm|brd|typ|stky|
|--|---|---|---|----|
FACT TABLE: Sales |41|jug|dow|gal|good|
.----.----.----.----.----.----. |67|rug|per|bth|good|
|tkey|ikey|bkey|lkey|$sld|#sld|
|----|----|----|----|----|----|
| 2| d67| nw1| nd8| 349| 5|
| 1| a41| nw1| nd8|1426| 10|
| 1| a41| nw1| mn1| 98| 1| DIM_TBL: Location
| 1| d67| se3| mn1| 7| 276| .---.---.---.
| 2| d67| nw1| mn1| 692| 2| |lk |str|cky|
|---|---|---|
|mn1|elm|A |
|nd8|oak|B |
DTBL:Cty
.--.---.--.----.
|ck|cty|st|ctry|
|--|---|--|----|
| A|dl |MN|USA |
| B|mot|ND|USA |
DIM_TBL: Branch
.----.---.----.
|bkey|bnm|btyp|
|----|---|----|
|nw1 |icy| whs|
|se3 |hot|stor|
GALAXY SCHEMA or FACT CONSTELLATION
===================================
Multiple fact tbls share dimension tables (e.g., Sales and Shipments)
DTBL: supp
.--.----.
DIM_TBL: Item |sk|styp|
.---.----.---.---.--. |--|----|
|ik |inam|brd|typ|sk| | 6|good|
|---|----|---|---|--| | 7|fair
|a41|jug |dow|gal| 6|
|d67|rug |per|bth| 6|
DIM_TBL: Time
.--.----.---.--.q.--.
|tk| day|dow|mo|t|yr|
|--|----|---|--|r|--|
| 1| 24|mon| 7|3|00| FACT TABLE: Sales
| 2| 31|thu| 8|3|00| .--.---.---.---.----.----.
. |tk|ik |bk |lk |$sld|#sld|
. |--|---|---|---|----|----|
| 2|d67|nw1|nd8| 349| 5|
| 1|a41|nw1|nd8|1426| 10|
DIM_TBL: Branch | 1|a41|nw1|mn1| 98| 1| DTBL: Loc
.--.---.----. | 1|d67|se3|mn1| 7| 276| .---.---.--.
|bk|bnm|btyp| | 2|d67|nw1|mn1| 692| 2| |lk |str|ck|
|--|---|----| |---|---|--|
nw1|icy| whs| |mn1|elm| A|
se3|hot|stor| |nd8|oak| B|
FACT_TBL: Shipment DTBL:Cty
.---.--.--.--.--.-----.------. .--.---.--.----.
|ik |tk|sk|fl|tl|$cost|#shipd| |ck|cty|st|ctry|
|---|--|--|--|--|-----|------| |--|---|--|----|
|d67| 2| 6|dl|fg| 45| 31| | A|dl |MN|USA |
|d67| 1| 6|fg|mt| 87| 9| | B|mot|ND|USA |
.
.
Examples for Defining Star Schema:
__________
//by time /|
// / |
// 9455 / |
//________/ |
y .---------------------------//--------.| |
t Chicago/ / / / // / || /|
c / 854 / 882 / 89 / 623 // 2448 / || / |
/______/______/______/______//_______/ e ||/ |
C / / / / // /|m ||102|
O New York/ / / / // / i /||87/|
L / 1087 / 968 / 38 / 872 // 2965 / t| / || / |
/______/______/______/______//_______/ |/ ||/ |
/ / / / // /|c /25 ||108|
Toronto/ / / / // / o /|95 ||20/|
/ 818 / 746 / 43 / 591 // 2198 / l| / | / || / |
/______/______/______/______//_______/ |/ |/ ||/ |_
/ / / / // /|s |31 |28 ||115|/|
Vancouver / / / // / e /|20/|04/||93// |
/ / / / // / l| / | / | / || // |
/_ent__/_comp_/_phone/_sec__//_______/ a |/ |/ |/ ||// |
| | | | || |s |23 |31 |30 |//421/l
Q1 | 605 | 825 | 14 | 400 || 1844 | /|97/|29/|54//||55/a
| | | | || | / | / | / | // || /t
|______|______|______|______||_______|/ |/ |/ |// ||/o
| | | | || | |25 |32 //218||T
Q2 | 680 | 952 | 31 | 512 || 2175 | /|21/|71//|02/
| | | | || | / | / | // | /
|______|______|______|______||_______|/ |/ |// |/
| | | | || | |26 //312/
Q3 | 812 | 1023 | 30 | 501 || 2366 | /|85// 54/
| | | | || | / | //| /
|______|______|______|______||_______|/ |// | /
| | | | || | //187/
Q4 | 927 | 1038 | 38 | 580 || 2583 | /| 69/
| | | | || | //| /
|______|______|______|______||_______|// | /
||-------|/ |/
|| 8968 | /
||by loc | /
|| | /
||_______|/
A star schema for this cube might be:
DIM_TBL: Item
.--.----.---.---.--.
|ik|inam|brd|typ|st|
|--|----|---|---|--|
|e |ent |ACM|big|gd|
|c |comp|BRS|slo|gd|
|p |phon|ATT|bad|bd|
|s |sec |KIL|mid|gd|
DTBL: Loc
.--.---.---.--.-----.
|lk|str|cty|st|total|
|--|---|---|--|-----|
|v |elm|van|BC| 8968|
|t |oak|tor|OT|18769|
|n | 45| NY|NY|31254|
|c |112|chi|IL|21802|
DIM_TBL: Time GRD_TOTAL
.--.---.--.------. .-------.
|tk|day|mo| total| |__42155|
|--|---|--|------|
| 1| 24|ma| 9455| FACT_TBL: Sales
| 2| 31|ju| 10287| .--.--.--.--.------------.
| 3| 15|au| 10820| |tk|lk|ik|bk|Dollars_sold|
| 4| 04|de| 11593| |--|--|--|--|------------|
| 1|v |e |w1| 605 |
| 1|t |e |w1| 818 |
DIM_TBL: Branch | 1|n |e |w1| 1087 |
.--.---.----. | 1|c |e |w1| 854 |
|bk|bnm|btyp| | 2|v |e |w1| 680 |
|--|---|----| | 2|t |e |w1| 894 |
|w1|WES| whs| | 2|n |e |w1| 1130 |
| 2|c |e |w1| 943 |
| 3|v |e |w1| 812 |
| 3|t |e |w1| 940 |
DIM: LOC_TIME | 3|n |e |w1| 1034 |
.-------------. | 3|c |e |w1| 1032 |
|tk|lk|Total | | 4|v |e |w1| 927 |
|--|--|-------| | 4|t |e |w1| 978 |
| 1|v | 1844 | | 4|n |e |w1| 1142 |
| 1|t | 2198 | | 4|c |e |w1| 1129 |
| 1|n | 2965 | | 1|v |c |w1| 825 |
| 1|c | 2448 | | 1|t |c |w1| 746 |
| 2|v | 2175 | | 1|n |c |w1| 968 |
| 2|t | 2397 | | 1|c |c |w1| 882 |
| 2|n | 3120 | | 2|v |c |w1| 952 |
| 2|c | 2595 | | 2|t |c |w1| 769 |
| 3|v | 2366 | | 2|n |c |w1| 1024 |
| 3|t | 2521 | | 2|c |c |w1| 890 |
| 3|n | 3129 | | 3|v |c |w1| 1023 |
| 3|c | 2804 | | 3|t |c |w1| 795 |
| 4|v | 2583 | | 3|n |c |w1| 1048 |
| 4|t | 2685 | | 3|c |c |w1| 924 |
| 4|n | 3271 | | 4|v |c |w1| 1038 |
| 4|c | 3054 | | 4|t |c |w1| 864 |
| 4|n |c |w1| 1091 |
| 4|c |c |w1| 992 |
| 1|v |p |w1| 14 |
| 1|t |p |w1| 43 |
| 1|n |p |w1| 38 |
| 1|c |p |w1| 89 |
.
.
.
****************************************************************