DISTRIBUTED DATABASE MANAGEMENT SYSTEM (DDBMS)

A DDBMS consists of multiple DBMSs connected over a network
  - users can access data at any site from any site.
  - we assume each site has its own: CPU, OS and local DBMS, DBA, users, storage,
              autonomy (relies little on centralized control or service)

If all DBMSs are the same (same vendor) we call it a homogeneous DDBMS.

If there are different DBMSs at the different sites, we call it a heterogeneous
   or federated DDBMS.


OBJECTIVES OF DISTRIBUTED DBMS

LOCATION TRANSPARENCY is achieved if users need not know which site has the data
   -simplifies logic of programs
   -allows data movement as usage patterns change

Support for DATA FRAGMENTATION is achieved if logical object (eg file) can be
   divided into multiple physical pieces for storage purposes
   (possibly at different sites - eg, for an accounts file, Fargo customers
    accounts can be stored in Fargo, Grand Forks customer accounts can be stored in GF...)
   FRAGMENTATION TRANSPARENCY is achieved if users need not know an object is fragmented

Support for DATA REPLICATION is achieved if logical objects can have more than 1 physical copy
   -advantages include availability
   -disadvantages include increased update overhead
   REPLICATION TRANSPARENCY is achieved if users need not know that data is replicated
   (Location, fragmentation, replication transparency imply "user need
    not know system is distributed at all")

Additional desireable features include:

LOCAL AUTONOMY.   This is achieved if system are distributed consistent with the logical
   and physical distribution of the enterprise.  It allows local control over local data,
   local accountability, less dependency on remote Data Processing centers.

Support for INCREMENTAL GROWTH AVAILABILITY and RELIABILITY.  Distributed systems can more
   easily allow for graceful (and unlimited) growth simply by adding additional sites.
   The DDBMS software should allow for the easy adding of sites.  Reliability can be provided
   by replicating data.  The DDBMS should allow for replication to enhance reliability and
   availability in the presence of failures of sites or links.



PROBLEMS WITH DISTRIBUTED SYSTEMS
---------------------------------

Long haul networks are usually slow
    (thus, there is a need to minimize number and volume of messages to achieve good overall
     performance - i.e., low response times)



DISTRIBUTED QUERIES

Suppose the query "Find all London suppliers of red parts" is satisfied
         by n records at site B and user is at site A:

- In a relational system this requires 2 messages: The query must be sent
  from A to B (going) and the results must be sent from B to A (coming back).

- In a nonrelational "record-at-a-time" system 2n messages are required:
"Get first record" (going),"first" record returned (coming back), check condition,
"Get next  record" (going), "next" record returned (coming back), check condition,
   ...
"Get last  record" (going), "last" record returned (coming back), check condition,

Thus, in DDBMSs Qery Optimization may be even more important than it is
      in centralized system.


Query Optimization Methods can be

STATIC (strategy of transmissions and local processing activities is fully
        determined before execution begins - at compile time) or

DYNAMIC (Each step is decided only after seeing results of previous steps).


In long-haul networks, response time is dominated by transmission costs
(local processing time are negligible by comparison and are usually assumed to be zero).

RESPONSE time usually assumed linear in number of bytes, X, sent:    R(X) = AX + B
  where B is the fixed (setup?) cost of the transmission and AX is the variable cost.


STATIC, QUERY PROCESSING ALGORITHM

usually takes as input: database statistics such as relation sizes & attribute sizes,
                projected sizes of attributes, etc  and

produces as output: a strategy for answering the query (a pattern of what transmissions
      to make, when, where; what local processing to  do, when and where)


Usually involves 4 phases:

LOCAL PROCESSING phase: do all processing (e.g., projections, selections and joins) that
     can be done initially at each site without need for data interchange between sites.
     The end result this phase is that there will be one participating relation at each
     participating site.

REDUCTION phase: selected "semijoins" are done to reduce the size of the participating
          relations by eliminating tuples that are not needed in answering the query.

TRANSPORT phase: send that one relation from each participating site
          (result of the reduction phase) to the querying site.

COMPLETION phase: finishing up processing using those relations 
           to get final answer (e.g., final projections, selections and joins)


It is commonly assume that all local processing has a cost of zero in response-time
and that all transmissions have a linear response time cost (sending X bytes causes a delay
of AX + B time units.)

What is a semijoin? (in the reduction phase)


SEMIJOIN of R1(A,B) to R2(A,C) on A is: (written:   R1 A-semijoin R2 )

1.  projection R1[A]
2.  R1[A] A-join R2  (select those tuples of R2 that participate in join)

For example,

 STUDENT-FILE 
.________________. 
|S#|SNAME |LCODE |
|==|======|======|
|25|CLAY  |NJ5101| 
|32|THAISZ|NJ5102|
|38|GOOD  |FL6321| 
|17|BAID  |NY2091|
|57|BROWN |NY2092|
`----------------'

 ENROLL FILE
.___________.   
|S#|C#|GRADE|  
|==|==|=====|
|32|8 | 89  |    
|32|7 | 91  |   
|32|6 | 62  |  
|38|6 | 98  | 
 -----------

ENROLL S#-semijoined_to STUDENT

1. project ENROLL onto the S# attribute:

|S#|
|==|
|32|
|38|

2.   Join the two relations on S#

|S#|      |S#|SNAME |LCODE |
|==| join |==|======|======| 
|32|      |25|CLAY  |NJ5101| 
|38|      |32|THAISZ|NJ5102|
          |38|GOOD  |FL6321| 
          |17|BAID  |NY2091|
          |57|BROWN |NY2092|

resulting in:

|S#|SNAME |LCODE |
|==|======|======| 
|32|THAISZ|NJ5102|
|38|GOOD  |FL6321| 


Note that the net effect of a semijoin from R1 to R2 along A is to 
produce a subrelation of R2 consisting of only those tuples from R2
which will participate in the full join of R1 JOIN R2 on A.

That is, the semijoin eliminates from R2, all those tuples that are
uneeded in construction of the join.  A semijoin can be viewed as a
special SELECTION operator also, since it selects out those tuples of
R2 that have a matching A-value in R1.

Thus the semijoin perfect for reducing the since of relations before they
are sent to the querying site.

Note that semijoins don't always end up reducing the size of a relation.
Consider STUDENT S#-semijoin_to ENROLL

Project STUDENT onto the S# attribute and join it with ENROLL:

|S#|      |S#|C#|GRADE|  
|==|      |==|==|=====|
|25| join |32|8 | 89  |    
|32|      |32|7 | 91  |   
|38|      |32|6 | 62  |  
|17|      |38|6 | 98  | 
|57|

resulting in:

|S#|C#|GRADE|  
|==|==|=====|
|32|8 | 89  |    
|32|7 | 91  |   
|32|6 | 62  |  
|38|6 | 98  | 
 -----------

which is identical to ENROLL.  Therefore this semjoin produced no reduction
at all.



Distributed semijoin of R1 at site 1 to R2 at site 2 along attribute A:
1.  projection R1[A]
2.  transmission of R1[A] to R2-site.
3.  R1[A] A-join R2  (select those tuples of R2 that participate in join)


Consider the following distributed query:

Assume  R1 is at site 1 and R2 is at site 2 and the Query arriving at site 3 is:
SELECT R1.A2,R2.A2 FROM R1,R2 WHERE R1.A1=R2.A1

At site 1:                             site 2
R1: A1 A2 A3 A4 A5 A6 A7 A8 A9       R2 A1 A2
    a  A  A  B  C  C  E  A  F           d   1
    a  C  D  D  E  A  A  B  B           e   2
    b  A  B  C  D  B  A  B  A           g   3
    c  D  D  B  B  A  C  A  C
    e  E  B  A  A  C  C  D  D

Assume the response time for transmission of X bytes between any 2 sites is
R(X) = X + 10      time units.


Strategy-1:    (No reduction phase).
1. Send R1 to site 3: 45 bytes sent, cost R(45)=45+10= 55.
2. Send R2 to site 3: 6 bytes sent, cost of R(6)=6+10= 16.
3. Final join (costs 0) gives eEBAACCDD2.    Response time= 71.

Strategy 1': If 1,2 done in parallel,             Resp time 55.


Strategy 2 (Reduction: R2 A1-semijoin R1)
1. Project:  R2[A1] = d e g                      COST=  0.
2. Send R2[A1] to site 1:                R(3) = 3+10 = 13.
3. Do R2 A1-semijoin R1 giving:  eEBAACCDD       COST=  0.
4. Send reduced R1 to site 3:            R(9) = 9+10 = 19.
5. Send R2 to site 3:                    R(6) = 6+10 = 16.
6. Final join gives eEBAACCDD2                    Resp time 48.

Strategy 2': If 2,5 done in parallel,             Resp time 32.


So clearly the reduction phase can reduce response time of query.
For static algorithms, the hard part is to decide at site 3 what
strategy to use without knowing exactly what the data looks like
at the other two sites.

There is a need to estimate the results of above,
since the  actual results are not known in advance at site 3.
That estimation method is important, because the situation can
be very different from the above, for example:

Same query as above but different data:

At site 1:                             site 2
R1: A1 A2 A3 A4 A5 A6 A7 A8 A9       R2 A1 A2
    d  A  A  B  C  C  E  A  F           d   1
    d  C  D  D  E  A  A  B  B           e   2
    e  A  B  C  D  B  A  B  A           g   3
    g  D  D  B  B  A  C  A  C
    e  E  B  A  A  C  C  D  D

Then Strategy 2 would be:
1. Project:  R2[A1] = d,e,g           COST= 0.
2. Send R2[A1] to site 1:    R(3) = 3+10 = 13.
3. R2 A1-semijoin R1: dAABCCEAF   
                      dCDDEAABB   
                      eABCDBABA   
                      gDDBBACAC   
                      eEBAACCDD       COST= 0.
4. Send reduced R1 to site 3 R(45)=45+10 = 55.
5. Send R2 to site 3:        R(6) = 6+10 = 16.
6. Final join dAABCCEAF1
              dCDDEAABB1
              eABCDBABA2 
              gDDBBACAC3 
              eEBAACCDD1     Response time 84.

In fact, Strategy 1 would be better for this data situation.

The question is: "How should reduction phase results be estimated?"

SELECTIVITY THEORY (Hevner, Yao) assumes data values are uniformly distributed
                    and attribute-distributions are independent of each other.

Results estimated as follows: (assuming A1 has domain {a,b,...z})


The Selectivity of attributed R1 is the ratio of the number of values present
(size of the extant domain) over the number of values possible (size of the full domain)
Therefore the selectivity of R2.A is 3/26

Using selectivity theory, we estimate the size of the semijoin, R2 A-semijoin R1 as:

(Original size of R1) * (selectivity of incoming attribute, R2.A)  or  45*3/26= 5.2

Selectivity theory estimates 5.2 bytes of R1 will survive the semijoin.
This is close for the first example database state and the algorithm
proposed by Hevner & Yao (ALGORITHM-GENERAL) would correctly select method 1.

However, it is way off in the second database state but ALGORITHM-GENERAL
would still select strategy-1 (not the best for this database state).




UPDATE PROPAGATION: To update any replicated data item, the DDBMS
                    must propagate the new value consistently to all copies.

IMMEDIATE method: update all copies (the update fails if even 1 copy is unavailable)

PRIMARY   method: designate 1 copy as primary for each item.  Update is deemed
          complete (COMMITTED) when primary copy is updated.
          Primary copy site is responsible for broadcasting the update to the other sites.
          Broadcast can be done in parallel while the transaction is contining, however that
                    runs counter to local autonomy theme


CONCURRENCY CONTROL
 -in a DDBMS using locking, requests to test, set, release locks require messages
     from Transaction Managers to the Scheduler and back again.

 AUTONOMY method (each site is responsible for its own locks)

          If a transaction, T, updates an item with n replicas, it requires 5n messages:
             n lock requests   (from T to n Schedulers at the replica sites)
             n lock grants     (from the n Schedulers to T)
             n update messages (from T to the n replica sites)
             n acks            (from the n replica sites to T)
             n unlock requests (from T to the n Schedulers)


 PRIMARY copy method (a primary copy site is responsible for all locks)

 If a transaction, T, updates an item with n replicas, it requires only 2n+3 messages:
           1 lock requests     (from T to the primary site Scheduler)
           1 lock grants       (from Scheduler to T)
           n update messages   (from T to the n replica sites)
           n acks              (from the n replica sites to T)
           1 unlock requests   (from T to the Scheduler)




Deadlocks can be very difficult to deal with in DDBMSs.  For instance,

GLOBAL DEADLOCK (in which no one site can detect the deadlock) can happen as, for example:
An Agent is a  representative of a transaction at a given site.
 An agent of T1 is waiting at site A for agent of T2 at site A to release a lock on x.
 An agent of T2 at site A is waiting for an agent of T2 at site B to complete
                          (assuming S2PL) before it can give up its locks.
 An agent of T2 is waiting at site B for an agent of T1 at site B to release a lock on y.
 An agent of T1 at site B is waiting for an agent of T1 at site A to complete before giving up locks.

                  x
SITE-A:       T1 --> T2
              ^      |
              |      |
              |      v
SITE-B:       T1 <-- T2
                  y

Further communication is required to handle global deadlock due
to the fact no local Wait-For-Graphs will have a cycle.




RELAXING autonomy requirement

A Heterogeneous Distributed DBMS (HDDBMS) integrates networked local
DBMSs to support global transactions (need data at >1 site).

The HDDBMS design approach is based on the following assumptions.

1. The distributed database system evolved bottom-up.  That is, the  user
enterprise already had many distinct and independent DBMSs in  place which
must be integrated into one HDDBMS.  As a consequence the  internal algorithms
of local DBMSs can be known but cannot be altered  & local DBMS may execute
transactions in an order other than submission or arrival order.

2. The HDDBMS is a general purpose sys (ie, not customized for specific
enterprise or specific data state).

3. Global trans restarts should be avoided whenever possible. (network &
local processing wasted).





Indirect conflict is a special Concurrency Control problem in
 Heterogenous DDBMSs which does not come up in centralized or
 even in homogeneous DDBMSs.


Suppose T1 is a local transaction (which is not known about by the
 heterogeneous DDBMS harness software) and that it is executing 
concurrently with two global transactions, T2 and T4 as below

 (the only ordering is that T2 requires r2(a) first, then w2(c) ):


                         SITE2
                         T2: r2(a),w2(c)
                             ^       \
                            /         \
                           /           \
                          /             \
            .--------- > a               v
T1: r1(b),w1(a)     SITE1                c   SITE3
       ^---------------  b               ^
                          ^.            /
                            \          /
                             \        /
                         T4: w4(b),w4(c)
                         SITE4


An acceptable ordering for these concurrent transaction operations
for both the local DBMS at SITE1 and the global HDDBMS is:

The Global Concurrency Controler would see no conflict between T2 and T4
and therefore would not impose any restrictions on how they are executed,
except that r2(a) is executed before w2(c).

DBMS1 (at site 1) could decide on the interleaving:
       w4(b) first,
       r1(b) next,
       w1(a) next.
Producing and ordering:  T4 < T1 < T2

DBMS3 (at site 3) could decide on the interleaving:
       w2(c) first,
       w4(c) next.
Producing and ordering:  T2 < T4


Note that there is no global serializable ordering since

 T4 comes before T2 at SITE1 and
 T2 comes before T4 at SITE3 and

That is, the global ordering is not equivalent to any serial order
(DBMS1 requires that T4 < T2 and DBMS3 requires that T2 < T4).


So, the execution history is non-serializable but no heterogeneous
DDBMS could detect it, assuming each local system is "autonomous"
in that it is a binary licensed commercial off-the-shelf product.



One solution is to force all distributed transactions to "be in conflict"
at each site from which they have common data data access needs.

This can be done simply by forcing each to write its id to a bogus data
 item called the "site ticket" (t1 at site-1 and t3 at site-3).

In this way (since they are both writing to a common data item,
 they are known to be in conflict at that site by the HDDBMS and
 therefore the HDDBMS can enforce a particular order.

                         SITE2

         T2: w2(t1),w2(t3),r2(a),w2(c)
                  \      \    ^      \
                   \      \  /        \
                    \      \/          \
                     \_____/\________   \
                          /\         \   \
            .--------- > a  v         v   v
T1: r1(b),w1(a)      SITE1  t1        t3  c   SITE3
       ^---------------  b   ^        ^   ^
                         ^  /        /   /
                          \/  ______/   /
                          /\ /         /
                         /  X         /
                    ____/  / \       /
                   /      /   \     /
          T4: w4(t1),w4(t3),w4(b),w4(c)

                         SITE4


HDDBMS now sees a conflict between T2 and T4 so it decides
 upon a compatible order for T2 and T4 access to t1 and t3, e.g., T2 < T4.
 Thus, the HDDBMS forces w2(t1) to be acknowledged before w4(t1) is sent and
       the HDDBMS forces w2(t3) to be acknowledged before w4(t3) is sent and


Then DBMS1 will have to order T2 before T4 at SITE1 and
     DBMS3 will have to order T2 before T4 at SITE3.










**************************************
APPENDIX: More on Concurrency Control.
-------------------------------------

Example program:

    PROCEDURE P begin
      Start;
      temp:=Read(x);
      temp:=temp+1
      Write(x,temp);
      Commit
    END

    for the purposes of concurrency control theory this program
    can be represented as the "History":

H0       r1[x] -> w1[x] -> c1

where r1[x] stands for the read operation by transaction-1 on dataitem, x.
and   w1[x] stands for the write operation by transaction-1 on dataitem, x.
and   c1    stands for the commit operation of transaction-1.

This graph is called the Serialization Graph (SG).  The nodes of SG are operations
and each edge represents the execution order of two conflicting operations.

(Note we don't even care what value gets written.  The value written is an
"uninterpreted" feature or characteristic, as far as concurrency control
 is concerned)


More Examples:
--------------
Consider three transactions,

T1  r1[x]--> w1[x]--> c1   
         
T2  r3[x]--> w3[y]--> w3[x]--> c3 
   
T4  r4[y]--> w4[x]--> w4[y]--> w4[z]--> c4


An example of a complete history (execution) for { T1,T3,T4 } as a SG is:


           r3[x]-> w3[y]-> w3[x]-> c3 
            ^        ^    
            |        |    
H1 r4[y]-> w4[x]-> w4[y]-> w4[z]-> c4 
            ^    
            |  
   r1[x]-> w1[x]-> c1  


We use simple left-to-right ordering to indicate the Von Neumann 
order of execution of operations when we wish to do so
(assuming a Von Neumann machine - one which executes one operation
at a time):



Four possible Von Neumann histories over the transactions,

	T1 = w1[x] w1[y] w1[z] c1
	T2 = r2[u] w2[x] r2[y] w2[y c2    are:

H2   w1x w1y r2u w2x r2y w2y c2 w1z c1 
H3   w1x w1y r2u w2x r2y w2y w1z c1 c2 
H4   w1x w1y r2u w2x w1z c1 r2y w2y c2 
H5   w1x w1y r2u w1z c1 w2x r2y w2y c2 




APPENDIX 2:
----------

Another approach to ROLL:   ROLL is a queue object of RVs
Methods available to transactions:
1. QUEUE(RV)     {returns ROLL address of RV}
2. READ(a,b)    {returns RVs from address a to b}
   where a must be an address returned to the TM from a POST
     and b must be an address returned to the TM from a POST or "HEAD"
   ("address" could be simply "position number" and HEAD could be position 0)
3. SET(a,list)
     sets the bits in all positions in the list of the RV at address a to zero
     where a must be an address returned to the TM from a POST
     list can be any comma list of numbers and/or intervals

Then TMs are responsible to compute AV from the link list returned by READ(a,HEAD)
and  TMs are responsible to determine validity of a rePOST from the return of READ(a,b)

Inefficient due to massive return parameters from READ.
Background daemon does the garbage collection periodically.
Could have an AV-Daemon accepting FastREAD requests from transactions, maintaining AVs
and returning the appropriate AV immediately.  AV-Daemon has read access to all RVs.

May be a good model for DataCycleROLL.
How would that work in a Beowulf Cluster such as MiDAS?
  - one transaction manager per node and round robin assignment of arriving





ROLL in a HETEROGENEOUS DISTRIBUTED DATABASE ENVIRONMENT


HYDRO TRANSACTION MODEL: This global trans processing model is assumed.


     T1                   T2
     |                    | ...
     V                    V
s:  TM1                  TM2
   / | \                / | \
  /  |  \              /  |  \
 /   |   \            /   |   \
 1    2    n          1    2    n
T    T    T          T    T    T
 1    1    1          2    2    2
|    |    |          |    |    |
v    v    v          v    v    v
.-------------------------------
|                   HYDRO OBJECT|
`-------------------------------'
|    |    |          |    |    |
v    v    v          v    v    v
 1    1    1          2    2    2
T    T    T          T    T    T
 1    2    n          1    2    n
|    |    |          |    |    |
v    v    v          v    v    v
LOCAL HYDRO          LOCAL HYDRO
  DRIVER1              DRIVER2
     |                    |
     V                    V
   DBMS1                DBMS2

Ti's are Global Trans Mgrs (query decomp & translation)

 i
T  subtrans or agent of Tj running at site i.
 j


         HYDRO protocol

At each site, there is a Local Trans ROLL (LROLL) and a Global Trans ROLL (GROLL)

      i
Each T  is submitted to its site in Serialization Partial Order by the HYDRO OBJECT.
      j                     

The LOCAL-HYDRO-DRIVER POSTs as below, then acknowledges to HYDRO
(guaranteeing POST order and serialization p.o. compatible).


POSTing:

Local Trans POST to the LROLL only, but
get a "position" in both ROLLs (read both tail-pointer values).

Global Trans POST to the GROLL only, but get a "position" in both ROLLs.


CHECKing:
--------
Local Trans CHECK (LCHECK) GROLL only (conflicts between pairs of local
trans are the responsibility of the local DBMS) If there are no
conflicts submit, else wait.

Global Trans CHECK (GCHECK) both LROLL and GROLL.
If there are no conflicts, submit else wait.

RELEASE: Same.
-------


VALIDATE: (For optimistic Trans)
---------
Submit following POSTs (as above).

Before COMMITing, validate the submission by restarting if CHECK
(LCHECK or GCHECK) show conflict.



Another Approach to HDDBMS Concurrency Control is the TICKET Method:

In the ticket method, each Local subtransaction must write to (write anything)
a special data item set up at that site (called the "site ticket") before submitting
any other operations.  The Global Scheduler simply makes sure that all
subtransactions submit their ticket write in a pre-established serialization
partial order (e.g., using a global ROLL or some other mechanism).

Then the overall global serialization partial order is going to be consistent with
that preestablished partial order since each local DBMS enforces local serializability.
(among local transactions and the global subtransactions that have written to the ticket).

Baiscally, this method puts all global transactions artificially in conflict with each
other (since they all write the same data item - the ticket) and therefore allows the
local Serializable schedulers to finish the job of overall serializable scheduling.

**************************************************************************************

COMMITMENT:


The COMMIT of a distributed transactions: (recall, that before the DDBMS can acknowledge
    the COMMIT of a distributed transaction, it must be sure that all sites can successfully
    make all changes permanent so as to guarantee the durability requirement.
    If even one site cannot, the COMMIT must be rejected.)


Marriage vows are an example of a distributed transaction commit protocol

Unless all parties are ready to commit, there is no marriage!

There are two phases, a VOTE phase and a DECISION phase:

VOTE PHASE.

There is a COORDINATOR (usually a minister, priest, justice_of_the_peace,...) who
issues VOTE_REQUESTS to the PARTICIPANTS, who vote yes or no.

Implicit in this request is the assumption that a yes means you are ready to get married
no matter what happens in the mean time (such as the other party hesitating or...?) and
that a no vote means you are not going to go any further with it).  So participants who
vote YES must be in the "READY" state - ready to either get married or not depending upon
the decision made by the coordinator:


Coord: "Lou, do you take Pat to be your lawfully wedded spouse?"   Lou: "yes!"
Coord: "Pat, do you take Lou to be your lawfully wedded spouse?"   Pat: "yes!"
Coord:  Does anyone here present know of any reason why these two people should not be joined matrimony?
              (ie, Is everyone else ready to commit this transaction?   "yes",
                                                                        "yes",
                                                                         ...
                                                                        "yes".

DECISION PHASE:
The Coordinator counts up the votes and makes the DECISION.

If all the votes are YES, coordinator decides COMMIT and broadcasts the decision to the participants:
  "I pronounce you married!".

If any vote is NO, the coordinator decides ABORT and broadcasts the decision to the participants:
  "Sorry folks!"



In DDBMSs we use a similar protocol for distributed transactions called "Two-Phase-Commit" (2PC).

The Coordinator (the system component managing commit protocol within the Transaction Manager)
             does the following upon receiving a COMMIT request from the transaction:


1. Coordinator requests all participant agents to VOTE yes/no (to be able to either commit or abort
   (READY STATE - force-write to disk, a log of all activities done by locally),
  If local agent can complete local activities and forcewrite a READY record to the log
  with enough detail to be able to either COMMIT or ABORT when told to by the Coordinator,
  it is successful and it votes YES, else it votes NOT and aborts.

2. Coordinator forcewrites "decision" entry to the log with COMMIT
   if all replies are YES, else ROLLBACK; then informs each agent
   of the decision and each agent COMMITS or ROLLSBACK accordingly.

If the system fails, the restart procedure looks for a decision record.
  If one is on disk, the 2-phase commit process can be picked up from
  where it left off.  Otherwise the restart process must assumes ROLLBACK and
  proceeds that way.


TWO PHASE COMMIT (2PC):

C  = Coordinator
Si = Subordinate-i
RP = Recovery Process

0. SEND "   5 ALL (NOT ALL) YES, FORCE COMMIT (ABORT) LOG, Move to COMMIT (ABORT) STATE
"PREPARE"   6 SEND COMMIT (ABORT to all NON-NO voters)
       
   C        C----------------------------------->C       C gets ACK
   v        ^                                    v       ^    11 WRITE "END-TRANS" LOCALLY
   |        |                                    |       |    12 FORGETS TRANSACTION
PREPARE  NO/YES                                COMMIT/   ACK
            |                                  ABORT     |
 /   \      |                                  /   \     | 
S1    S2    |                                S1    S2    |
 `----`---->'                                 `----`---->'

Si READY TO COMMIT (ABORT),                     Si->COMMIT (ABORT) STATE

1 FORCE "READY"(ABORT) LOG & ENTER READY ST     7 FORCE "COMMIT" (ABORT) LOGREC
2 SEND YES (NO)                                 8 SEND ACK (NON-NO's SEND ACK)
3 (ABORT LOCALLY)                               9 COMMIT(ABORT) LOCALLY
4 (FORGET TRANS)                               10 FORGET TRANS


FAILURE TYPES:
-------------
A. Si WRITES NO LOG, Si FAILS:                 C. C WRITE NO LOG, C FAILS:
   RP must UNDO TRANS, FORCE ABORT LOGREC,        RP UNDO TRANS, FORCE ABORT LOGREC, answers
   FORGET TRANS.                                  all inquiries with ABORT, FORGET TRANS.

B. ALL Si in "READY" STATE when an Si FAILS:   D. R IN COMMIT (ABORT) STATE C FAILS:
   RP SENDS "YES" (As an inquiry)                 RP COMMIT(ABORT) Send COMMIT to Si's, wait for ACKs back,
                                                  writes "END", FORGETS TRANS.

E. R NOTES Site (OR LINK) FAILURE, BROADCASTS THE ABORT, When ALL ACKs come back,
                  R WRITE "END" RECORD and FORGET the TRANSACTION.
F. Si NOTES C has FAILED BEFORE(AFTER) "READY", Si ABORTS (trans RP goes to B)
G. RP GETS INQUIRY FROM a "READY" Si & HAS NO STATE INFO ON TRANS, SENDS ABORT.
H. RP GETS INQUIRY FROM a "READY" Si & C IS IN THE COMMIT(ABORT) STATE, SENDS COMMIT(ABORT).



TWO PHASE PRESUMED ABORT (2PPA)
------------------------------
Note in A & C, if the RP finds no log records, the RP decides to ABORT.
Thus, it is safe for R to forget trans after ABORT decision (gets a NO vote) &
      writes (not forcewrite) ABORT logrec (no need for subord names or END rec)

For read-only trans,
2'.   SEND "READ-VOTE" (instead of YES) & forget the transaction (combining 1,2,7-10 into 2',10')
Coord need not send decision at all (eliminating 6,11,12)
For entirely read-only transactions, no second phase at all.
For partially read-only transcations, this eliminates the second phase for those subordinates.
------------------------------


TWO PHASE PRESUMED COMMIT (2PPC)
------------------------------
Since most transactions commit;
QUESTION: By requiring ACKs to ABORTS, can commits be made simpler?

2PPC: require ABORT ACKs, no COMMIT-ACKs
      force ABORT-LOGRECs, no force subord-COMMIT-LOGRECs
------------------------------


An interesting Thesis of a former 765 student related to Atomic Commitment:

ACP & DQP Over Active Networks


[ Class List || Perrizo's Home || NDSU Home || Next ]


perrizo@plains.nodak.edu