Re: Strong Consistency with ONE read/writes

Will Oberman Sun, 03 Jul 2011 14:56:52 -0700

Why not send the value itself instead of a placeholder? Now it takes2x writes on a random node to do a single update (write placeholder,write update) and N*x writes from the client (write value, writeplaceholder to N-1). Where N is replication factor. Seems like extranetwork and IO instead of less...

Of course, I still think this sounds like reimplementing Cassandrainternals in a Cassandra client (just guessing, I'm not a cassandra dev)



On Jul 3, 2011, at 5:20 PM, AJ <a...@dude.podzone.net> wrote:

Yang,
How would you deal with the problem when the 1st node respondssuccess but then crashes before completely forwarding any replicas?Then, after switching to the next primary, a read would return staledata.
Here's a quick-n-dirty way: Send the value to the primary replicaand send placeholder values to the other replicas. The placeholdervalue is something like, "PENDING_UPDATE". The placeholder valuesare sent with timestamps 1 less than the timestamp for the actualvalue that went to the primary. Later, when the changes propagate,the actual values will overwrite the placeholders. In event of acrash before the placeholder gets overwritten, the next read valuewill tell the client so. The client will report to the user thatthe key/column is unavailable. The downside is you've overwrittenyour data and maybe would like to know what the old data was! But,maybe there's another way using other columns or with MVCC. Theclient would want a success from the primary and the secondaryreplicas to be certain of future read consistency in case theprimary goes down immediately as I said above. The ability to setan "update_pending" flag on any column value would probably makethis work. But, I'll think more on this later.
aj

On 7/2/2011 10:55 AM, Yang wrote:
there is a JIRA completed in 0.7.x that "Prefers" a certain node insnitch, so this does roughly what you want MOST of the time
but the problem is that it does not GUARANTEE that the same nodewill always be read. I recently read into the HBase vs Cassandracomparison thread that started after Facebook dropped Cassandra fortheir messaging system, and understood some of the differences.what you want is essentially what HBase does. the fundamentaldifference there is really due to the gossip protocol: it's aprobablistic, or eventually consistent failure detector whileHBase/Google Bigtable use Zookeeper/Chubby to provide a strongfailure detector (a distributed lock). so in HBase, if a tabletserver goes down, it really goes down, it can not re-grab thetablet from the new tablet server without going through a start upprotocol (notifying the master, which would notify the clientsetc), in other words it is guaranteed that one tablet is served byonly one tablet server at any given time. in comparison the aboveJIRA only TRYIES to serve that key from one particular replica.HBase can have that guarantee because the group membership ismaintained by the strong failure detector.
just for hacking curiosity, a strong failure detector + Cassandrareplicas is not impossible (actually seems not difficult), althoughthe performance is not clear. what would such a strong failuredetector bring to Cassandra besides this ONE-ONE strongconsistency ? that is an interesting question I think.
considering that HBase has been deployed on big clusters, it isprobably OK with the performance of the strong Zookeeper failuredetector. then a further question was: why did Dynamo originallychoose to use the probablistic failure detector? yes Dynamo's maintheme is "eventually consistent", so the Phi-detector is**enough**, but if a strong detector buys us more with little cost,wouldn't that be great?
On Fri, Jul 1, 2011 at 6:53 PM, AJ <a...@dude.podzone.net> wrote:
Is this possible?
All reads and writes for a given key will always go to the samenode from a client. It seems the only thing needed is to allow theclients to compute which node is the closes replica for the givenkey using the same algorithm C* uses. When the first replicareceives the write request, it will write to itself which shouldcomplete before any of the other replicas and then return. Theloads should still stay balanced if using random partitioner. Ifthe first replica becomes unavailable (however that is defined),then the clients can send to the next repilca in thering and switch from ONE write/reads to QUORUM write/readstemporarily until the first replica becomes available again.QUORUM is required since there could be some replicas that were notupdated after the first replica went down.
Will this work? The goal is to have strong consistency with a read/write consistency level as low as possible while secondarily anetwork performance boost.

Re: Strong Consistency with ONE read/writes

Reply via email to