REPLICATED DATA
Introduction
------------
A replicated database is a distributed database in which
multiple copies of some data items are stored at multiple sites.
- main reason for using replicated data is to increase the data
availability ( with respect to partial system failure).
- Another goal is improved performance i.e, data likely to be
"closer" to query node.
- Above benefit is mitigated by the need to update all copies.
Thus READS may run faster at the expense of slower WRITES.
Goal is to design a DBS that hides all aspects of data replication
from users trans.
In other words we need to design a DBS that has
replication transparency.
CORRECTNESS
------------
A DBS managing a replicated database should behave like a DBS managing a
one-copy database insofar as users can tell.
- Interleave execution of the transactions on a replicated database to be
equivalent to a serial execution of the transactions on a one-copy
database. Such executions are called one-copy serializable(or 1SR).
WRITE-ALL APPROACH
------------------
DBS translates each Read(X) into Read(Xa), where Xa is any copy of data
item X ( Xa denotes the copy of X at site A).
It translates each Write(X) into into {Write(Xa1),...,Write(Xam)},
where {Xa1,...,Xam} are all copies of x.
And it uses any serializable concurrency control algorithm to synchronize
access to copies. This is the Write-All approach to replicated data.
Unfortunately, sites can fail and recover. Because the Write-All approach
requires the DBS to process each write(X) by writing into all copies of X,
even if some have failed.
Since there will be times when some copies of X are down, the DBS will not
always be able to write into all copies of X.
In this situation it would have to delay processing Write(X) until it could
write into all copies of x. Such a delay is obviously bad for update trans.
More replication of data makes the system less available to update trans!
So Write-All approach may be unsatisfactory.(Poor performance).
WRITE-ALL-AVAILABLE Approach.
----------------------------
Translate Write(X) into {Write(Xa1),...,Write(Xm1)} where {Xa1,...,Xam}
set of available copies.
- solves the availability problem but may lead to correctness problems.
- executions can be non-1SR. Following H shows how this can happen:
W1[Xa] W1[Xb] W1[Yc] C1 R2[Yc] W2[Xa] C2 R3[Xb] W3[Yc] C3
^ ^
| |
site B fails site B recovers
- T2 did not write all copies ( missed W2[Xb] )
- The problem is solved by preventing transactions from reading copies
from sites that have failed and recovered until these copies are
brought up-to-date. Unfortunately this isn't enough.
- Alg which correctly handle failures and recoveries and thereby avoid
incorrect executions such as H are the main topic in this chapter.
SYSTEM ARCHITECTURE.
-------------------
Assume that the DBS is distributed. Each site has a DM and a TM that
manages data and transactions at that site.
DM centralized, satisfies Redo rule (all writes on stable storage before commit)
- Scheduler is recoverable ( Wi(X) < Rj(X) => Ci < Cj )
- Scheduler only sensitive to conflicts on the same copy
TM
- translates users r/w into r/w on copies of those data items.
- sends to appropriate sites
- uses ACP
FAILURE ASSUMPTIONS
-------------------
Xa of a data item at site A is available to site B if A correctly executes
each R/W on Xa issued by B and B receives A's ack of that execution.
Xa is unavailable if
1. A doesn't get r/w issued by B ( comm failure)
2. A unable to execute r/w ( A down or media failure containing Xa)
3. B doesn't get ack ( comm failure)
We say copy Xa is available (or unavailable) if it is available (or not
available) to every site other than A.
DISTRIBUTING WRITES.
-------------------
The DBS can distribute writes immediately or deferred
- immediately , as it receives Write(X) from the transaction. More communication.
- deferred
- must maintain an intentions list of deferred updates.
- it can piggyback this message with the Vote-Req message
- Aborts easier to implement
- slows ACP
- delays conflict detection
example :T1 and T2 execute concurrently and both write into X.
Also T1 uses Xa and T2 uses Xb. Until DBS distributes T1's replicated
Write on Xb and T2's replicated Write on Xa no scheduler will detect
the conflicting Writes between T1 and T2. With deferred Writing this
happens at the end of T1's and T2's execution. This may be less desir-
able than immediate writing.
- This disadvantage can be solved by requiring the DBS to use the same
copy of each data item, called the primary copy to execute every
transaction . For example DBS would use the same copy of X, Xa, to
execute both T1 and T2.
SERIALIZABILITY
---------------
Replicated Data (RD) history = DBS view of actual execution
One-copy (1C) histories = interpretation of RD in user's single copy view of DB
AN AVAILABLE COPIES ALGORITHM
------------------------------
- Assume strict 2PL ( recall => rigorous )
- Assume no comm failures
- Fixed set of copies of each data item , Known to all sites
- Assume each copy is created and fails at most once.
READS
------
Ti issues a Ri(X), TM at home site selects Xa (closets?) and
submits Ri(Xa) on behalf of Ti to site A.
Site A regards Xa as initialized if it has already
processed a Write(Xa) even if uncommitted.
- If A operational and Xa initialized, Read(Xa) processed by scheduler and
DM at A ( if final write(Xa) not committed, wait )
- If reject Read(Xa) nack to Ti
- If A not operational or Xa not initialized, TM will timeout, could abort
Ti or resubmit Read(Xb) (until no more copies)
WRITE
-------
Ti issues Wi(X), TM sends Write(Xa) to every operational site where
copy of X is supposed to be stored.
- If one site, A , is down TM will timeout waiting for response
- If A up ,
if Xa initialized, then Write(Xa) processed by Scheduler and DM
at site A and response sent as to reject/processed.
if Xa not initialized
1. could use Write(Xa) to initialize
2. could ignore Write(Xa) ( acts as if down)
TM waits for responses ( no response = missing write)
if any reject or all are missing writes,
then Ti aborted
else Write(X) successful
VALIDATION
------------
Need validation protocol to ensure correctness
T 's VP starts after R/W acked or timed out.
i
The validation protocol consists of 2 steps :
1. missing writes validation
-------------------------
Ti makes sure all copies it tried to but couldn't write are still unavailable
- send " unavailable(Xa)" message
- waits for response, if no response proceed with validation
else abort.
2. Access validation
------------------
Ti makes sure all copies it read or wrote to are still available
-if 1. succeeds, Ti sends "available(Xa) " to site A , for each copy
Xa that Ti read or wrote. A acks this mesg if Xa is still available at
the time A receives the message. If all "available" mesgs are acked then
access validation succeeds and Ti is allowed to commit or Ti must abort.
-Vote_Req mesg can be used as "available(Xa)" mesg
-if no missing writes, 1. unnecessary
COMMUNICATION FAILURES
----------------------
- not tolerated by available copy algorithms
- may produce non-1SR executions (example Pg 294 )
- must prevent transactions accessing same data item from executing
in different components.
- insist only one component process transactions
- either comp must be able to independently decide if it is comp
which can process.
Site Quorums
------------
- assign a non-negative weight to each site
- each site knows total network weight
- Quorum = any set of sites >= 1/2 total weight
- possible no comp has quorum
- transactions can R/W
1. non-replicated data anyway
2. data for which all replicas in same comp
- Read only trans can be allowed to go
- Major prob is if transactions have inconsistent views of the components
then they can produce incorrect results.
eg : a DBS containing ( Xa, Xb, Xc ), suppose site A's TM executes T1
believing that only A and B are accessible while site C's TM executes
T2 believing that only B and C are accessible, thereby producing H8
see page 296 H8
Both A and C were able to communicate with B without being able to
communicate with each other.
QUORUM CONSENSUS ALGORITHM
---------------------------
- assign non-negative weight to each copy of X
- Define Read threshold(RT) and write threshold(WT) for X such that
both 2.WT and (RT + WT) are greater than the total weight of all copies of X.
Read quorum (Write quorum) for x is any set of Xa's with weights >= RT(WT)
( each WQ(X) has >= 1 copy in common with any RQ and any other WQ)
- TM translates r/w's on X to r/w's on copies
- each Write(X) - > set of Write(Xa)'s with WQ
- each Read(X) - > set of Read(Xa)'s with RQ
-return Xctn, most up-to-date copy read
- So must tag each copy with version # (initially 0)
- When TM processes Write(X), determines max version # of copies
its about to write, adds 1, tags all other versions with that
- Each Read(X) returns value + version #
(TM always selects copy with largest version #)
- recovery aided by version #
Main Drawback : need to access multiple remote copies of X in order
to process Read(x), even if there is a local copy available... there
by defeating the purpose of replication.
VIRTUAL PARTITION ALGORITHM
----------------------------
(Read(X) need only access one copy)
- each copy has non-negative weight
- each X has Read & Write threshold
RT, WT 2WT > TOT of X and WT + RT > TOT
- each site, A, maintains a "view " V(A) (= sites it believes it can comm with)
V(T) = V( home(T) )
- T executes as if V(T) is it (ie network consists only the sites in its view)
- for DBS to process Read(X), Write(X) of T, V(T) must
contain RT(WT) for X ( can determine without comm ) if not abort T
- DBS uses Write-all read-one within V(T) . Pg 305 eg
- on ACP T must check if its view is correct and if not abort