there is a JIRA completed in 0.7.x that "Prefers" a certain node in snitch,
so this does roughly what you want MOST of the time


but the problem is that it does not GUARANTEE that the same node will always
be read.  I recently read into the HBase vs Cassandra comparison thread that
started after Facebook dropped Cassandra for their messaging system, and
understood some of the differences. what you want is essentially what HBase
does. the fundamental difference there is really due to the gossip protocol:
it's a probablistic, or eventually consistent failure detector  while
HBase/Google Bigtable use Zookeeper/Chubby to provide a strong failure
detector (a distributed lock).  so in HBase, if a tablet server goes down,
it really goes down, it can not re-grab the tablet from the new tablet
server without going through a start up protocol (notifying the master,
which would notify the clients etc),  in other words it is guaranteed that
one tablet is served by only one tablet server at any given time.  in
comparison the above JIRA only TRYIES to serve that key from one particular
replica. HBase can have that guarantee because the group membership is
maintained by the strong failure detector.

just for hacking curiosity, a strong failure detector + Cassandra replicas
is not impossible (actually seems not difficult), although the performance
is not clear. what would such a strong failure detector bring to Cassandra
besides this ONE-ONE strong consistency ? that is an interesting question I
think.

considering that HBase has been deployed on big clusters, it is probably OK
with the performance of the strong  Zookeeper failure detector. then a
further question was: why did Dynamo originally choose to use the
probablistic failure detector? yes Dynamo's main theme is "eventually
consistent", so the Phi-detector is **enough**, but if a strong detector
buys us more with little cost, wouldn't that  be great?



On Fri, Jul 1, 2011 at 6:53 PM, AJ <a...@dude.podzone.net> wrote:

> Is this possible?
>
> All reads and writes for a given key will always go to the same node from a
> client.  It seems the only thing needed is to allow the clients to compute
> which node is the closes replica for the given key using the same algorithm
> C* uses.  When the first replica receives the write request, it will write
> to itself which should complete before any of the other replicas and then
> return.  The loads should still stay balanced if using random partitioner.
>  If the first replica becomes unavailable (however that is defined), then
> the clients can send to the next repilca in the ring and switch from ONE
> write/reads to QUORUM write/reads temporarily until the first replica
> becomes available again.  QUORUM is required since there could be some
> replicas that were not updated after the first replica went down.
>
> Will this work?  The goal is to have strong consistency with a read/write
> consistency level as low as possible while secondarily a network performance
> boost.
>

Reply via email to