REPLICATED DATA


Introduction
------------ 

A replicated database is a distributed  database in which
multiple copies of some data items are stored at multiple sites.

 - main reason for using replicated data is to increase the data 
   availability ( with respect to partial system failure).

 - Another goal is improved performance i.e, data likely to be
   "closer" to query node.

 - Above benefit is mitigated by the need to update all copies.
   Thus READS may run faster at the expense of slower WRITES.

Goal is to design a DBS that hides all aspects of data replication
 from users trans.

In other words we need to design a DBS that has
 replication transparency.


CORRECTNESS
------------  
 
A DBS managing a replicated database should behave like a DBS managing a
one-copy database insofar as users can tell.
  - Interleave execution of the transactions on a replicated database to be 
    equivalent to a serial execution of the transactions on a one-copy 
    database. Such executions are called one-copy serializable(or 1SR).

WRITE-ALL APPROACH
------------------ 

DBS translates each Read(X) into Read(Xa), where Xa is any copy of data
item X ( Xa denotes the copy of X at site A).

It translates each Write(X) into into {Write(Xa1),...,Write(Xam)},
where {Xa1,...,Xam} are all copies of x.

And it uses any serializable concurrency control algorithm to synchronize
access to copies. This is the Write-All approach to replicated data.

Unfortunately, sites can fail and recover.  Because the Write-All approach
requires the DBS to process each write(X) by writing into all copies of X,
even if some have failed.

Since there will be times when some copies of X are down, the DBS will not
always be able to write into all copies of X.

In this situation it would have to delay processing Write(X) until it could 
write into all copies of x. Such a delay is obviously bad for update trans.

More replication of data makes the system less available to update trans!
So Write-All approach may be unsatisfactory.(Poor performance).

WRITE-ALL-AVAILABLE Approach.
----------------------------

Translate Write(X) into {Write(Xa1),...,Write(Xm1)} where {Xa1,...,Xam}
set of available copies.
       - solves the availability problem but may lead to correctness problems.
       - executions can be non-1SR. Following H shows how this can happen:
W1[Xa] W1[Xb] W1[Yc] C1 R2[Yc] W2[Xa] C2 R3[Xb] W3[Yc] C3
                       ^                ^ 
                       |                | 
                    site B fails      site B recovers

      - T2 did not write all copies ( missed W2[Xb] )
      - The problem is solved by preventing transactions from reading copies 
        from sites that have failed and recovered until these copies  are
        brought up-to-date. Unfortunately this isn't enough.
      - Alg which correctly handle failures and recoveries and thereby avoid
        incorrect executions such as H are the main topic in this chapter.

SYSTEM ARCHITECTURE.
-------------------

Assume that the DBS is distributed. Each site has a DM and a TM that 
manages data and transactions at that site.

DM centralized, satisfies Redo rule (all writes on stable storage before commit)
   - Scheduler is recoverable ( Wi(X) < Rj(X) => Ci < Cj )
   - Scheduler only sensitive to conflicts on the same copy

TM
  - translates users r/w into r/w on copies of those data items.
  - sends to appropriate sites
  - uses ACP

FAILURE ASSUMPTIONS
-------------------

Xa of a data item at site A is available to site B if A correctly executes
each R/W on Xa issued by B and B receives A's ack of that execution.

Xa is unavailable if 
  1. A doesn't get r/w issued by B ( comm failure)
  2. A unable to execute r/w ( A down or media failure containing Xa)
  3. B doesn't get ack ( comm failure)

We say copy Xa is available (or unavailable) if it is available (or not 
available) to every site other than A.

DISTRIBUTING WRITES.
-------------------

The DBS can distribute writes  immediately or deferred
   - immediately , as it receives Write(X) from the transaction.  More communication.
     
   - deferred
       - must maintain an intentions list of deferred updates.
       - it can piggyback this message with the Vote-Req message
       - Aborts easier to implement
       - slows ACP
       - delays conflict detection
         example :T1 and T2 execute concurrently and both write into X.
         Also T1 uses Xa and T2 uses Xb. Until DBS distributes T1's replicated 
         Write on Xb and T2's replicated Write on Xa no scheduler will detect
         the conflicting Writes between T1 and T2. With deferred Writing this
         happens at the end of T1's and T2's execution. This may be less desir-
         able than immediate writing.               
         - This disadvantage can be solved by requiring the DBS to use the same
           copy of each data item, called the primary copy to execute every 
           transaction . For example DBS would use the same copy of X, Xa, to
           execute both T1 and T2.

SERIALIZABILITY
---------------

Replicated Data (RD) history = DBS view of actual execution

One-copy (1C) histories = interpretation of RD in user's single copy view of DB

AN AVAILABLE COPIES ALGORITHM
------------------------------
   - Assume strict 2PL ( recall => rigorous )
   - Assume no comm failures
   - Fixed set of copies of each data item , Known to all sites 
   - Assume each copy is created and fails at most once.

 READS   
------

Ti issues a Ri(X), TM at home site selects Xa (closets?) and
   submits Ri(Xa) on behalf of Ti to site A. 

Site A regards Xa as initialized if it has already 
processed a Write(Xa) even if uncommitted.
  - If A operational and Xa initialized, Read(Xa) processed by scheduler and
    DM at A ( if final write(Xa) not committed, wait )
  - If reject Read(Xa) nack to Ti
  - If A not operational or Xa not initialized, TM will timeout, could abort 
    Ti or resubmit Read(Xb) (until no more copies)

 WRITE
-------

Ti issues Wi(X), TM sends Write(Xa) to every operational site where
 copy of X is supposed to be stored.
   - If one site, A , is down TM will timeout waiting for response
   - If A up ,
          if Xa initialized, then Write(Xa) processed by Scheduler and DM
          at site A and response sent as to reject/processed.
        
          if Xa not initialized 
           1. could use Write(Xa) to initialize
           2. could ignore Write(Xa)   ( acts as if down)

TM waits for responses    ( no response = missing write)
    if any reject or all are missing  writes,
     then  Ti aborted
     else  Write(X) successful

 VALIDATION
------------

   Need validation protocol to ensure correctness 
   
  T 's  VP starts after R/W acked or timed out.
   i
 
  The validation protocol consists of 2 steps :

   1. missing writes validation
      -------------------------
      Ti makes sure all copies it tried to but couldn't write are still unavailable
      - send " unavailable(Xa)" message
      - waits for response, if no response proceed with validation
        else abort.

   2. Access validation
      ------------------
      Ti makes sure all copies it read or wrote to are still available 
      -if 1. succeeds, Ti sends "available(Xa) " to site A , for each copy 
       Xa that Ti read or wrote. A acks this mesg if Xa is still available at
       the time A receives the message. If all "available" mesgs are acked then
       access validation succeeds and Ti is allowed to commit or Ti must abort.
      -Vote_Req mesg can be used as "available(Xa)" mesg
      -if no missing writes, 1. unnecessary


    
COMMUNICATION FAILURES
----------------------

- not tolerated by available copy algorithms
   - may produce non-1SR executions (example Pg 294 )
   - must prevent transactions accessing same data item from executing
     in different components.
   - insist only one component process transactions
      - either comp must be able to independently decide if it is comp
        which can process.

 Site Quorums
 ------------
   - assign a non-negative weight to each site
   - each site knows total network weight
   - Quorum = any set of sites >= 1/2 total weight
   - possible no comp has quorum
   - transactions can  R/W 
        1. non-replicated data anyway
        2. data for which all replicas in same comp
   - Read only trans can be allowed to go 
   - Major prob is if transactions have inconsistent views of the components
     then they can produce incorrect results.
    eg : a DBS containing ( Xa, Xb, Xc ), suppose site A's TM executes T1
    believing that only A and B are accessible while site C's TM executes
    T2 believing that only B and C are accessible, thereby producing H8
 
     see page 296 H8

    Both  A and C were able to communicate with B without being able to 
    communicate with each other.

   QUORUM CONSENSUS ALGORITHM
   ---------------------------
   - assign non-negative weight to each copy of X
   - Define Read threshold(RT) and write threshold(WT) for X such that 
     both 2.WT and (RT + WT) are greater  than the total weight of all copies of X.
                                                          
Read quorum (Write quorum) for x is  any set of Xa's with weights >= RT(WT)
   ( each WQ(X) has >= 1 copy in common with any RQ and any other WQ)
 
  - TM translates r/w's on X to r/w's on copies
                                         
     - each Write(X) - > set of Write(Xa)'s  with WQ
                                       
     - each Read(X) - > set of Read(Xa)'s with RQ
     -return Xctn, most up-to-date copy read   

 - So must tag each copy with version # (initially 0)
 - When TM processes Write(X), determines max version # of copies
   its about to write, adds 1, tags all other versions with that
 - Each Read(X) returns value + version #
   (TM always selects copy with largest version #)
    - recovery aided by version #

 Main Drawback : need to access multiple remote copies  of X in order
 to process Read(x), even if there is a local copy available... there
 by defeating the purpose of replication.

VIRTUAL PARTITION ALGORITHM
----------------------------
    (Read(X) need only access one copy) 

- each copy has non-negative weight
- each X has Read & Write threshold 
   RT, WT 2WT > TOT of X and WT + RT > TOT
- each site, A, maintains a "view " V(A) (= sites it believes it can comm with)
       V(T) = V( home(T) )
- T executes as if V(T) is it (ie network consists only the sites in its view)
- for DBS to process Read(X), Write(X) of T, V(T) must
  contain RT(WT) for X ( can determine without comm ) if not abort T
- DBS uses Write-all read-one within V(T) . Pg 305 eg
- on ACP T must  check if its view is correct and if not abort