Re: Strong Consistency with ONE read/writes

AJ Sun, 03 Jul 2011 14:20:49 -0700

Yang,

How would you deal with the problem when the 1st node responds successbut then crashes before completely forwarding any replicas? Then, afterswitching to the next primary, a read would return stale data.

Here's a quick-n-dirty way: Send the value to the primary replica andsend placeholder values to the other replicas. The placeholder value issomething like, "PENDING_UPDATE". The placeholder values are sent withtimestamps 1 less than the timestamp for the actual value that went tothe primary. Later, when the changes propagate, the actual values willoverwrite the placeholders. In event of a crash before the placeholdergets overwritten, the next read value will tell the client so. Theclient will report to the user that the key/column is unavailable. Thedownside is you've overwritten your data and maybe would like to knowwhat the old data was! But, maybe there's another way using othercolumns or with MVCC. The client would want a success from the primaryand the secondary replicas to be certain of future read consistency incase the primary goes down immediately as I said above. The ability toset an "update_pending" flag on any column value would probably makethis work. But, I'll think more on this later.


aj

On 7/2/2011 10:55 AM, Yang wrote:

there is a JIRA completed in 0.7.x that "Prefers" a certain node insnitch, so this does roughly what you want MOST of the time
but the problem is that it does not GUARANTEE that the same node willalways be read. I recently read into the HBase vs Cassandracomparison thread that started after Facebook dropped Cassandra fortheir messaging system, and understood some of the differences. whatyou want is essentially what HBase does. the fundamental differencethere is really due to the gossip protocol: it's a probablistic, oreventually consistent failure detector while HBase/Google Bigtableuse Zookeeper/Chubby to provide a strong failure detector (adistributed lock). so in HBase, if a tablet server goes down, itreally goes down, it can not re-grab the tablet from the new tabletserver without going through a start up protocol (notifying themaster, which would notify the clients etc), in other words it isguaranteed that one tablet is served by only one tablet server at anygiven time. in comparison the above JIRA only TRYIES to serve thatkey from one particular replica. HBase can have that guarantee becausethe group membership is maintained by the strong failure detector.
just for hacking curiosity, a strong failure detector + Cassandrareplicas is not impossible (actually seems not difficult), althoughthe performance is not clear. what would such a strong failuredetector bring to Cassandra besides this ONE-ONE strong consistency ?that is an interesting question I think.
considering that HBase has been deployed on big clusters, it isprobably OK with the performance of the strong Zookeeper failuredetector. then a further question was: why did Dynamo originallychoose to use the probablistic failure detector? yes Dynamo's maintheme is "eventually consistent", so the Phi-detector is **enough**,but if a strong detector buys us more with little cost, wouldn't thatbe great?
On Fri, Jul 1, 2011 at 6:53 PM, AJ <a...@dude.podzone.net<mailto:a...@dude.podzone.net>> wrote:
    Is this possible?

    All reads and writes for a given key will always go to the same
    node from a client.  It seems the only thing needed is to allow
    the clients to compute which node is the closes replica for the
    given key using the same algorithm C* uses.  When the first
    replica receives the write request, it will write to itself which
    should complete before any of the other replicas and then return.
     The loads should still stay balanced if using random partitioner.
     If the first replica becomes unavailable (however that is
    defined), then the clients can send to the next repilca in the
    ring and switch from ONE write/reads to QUORUM write/reads
    temporarily until the first replica becomes available again.
     QUORUM is required since there could be some replicas that were
    not updated after the first replica went down.

    Will this work?  The goal is to have strong consistency with a
    read/write consistency level as low as possible while secondarily
    a network performance boost.

Re: Strong Consistency with ONE read/writes

Reply via email to