Yang, you seem to understand all of the details, at least the details that have occurred to me, such as having a failure protocol rather than a perfect failure detector and new leader coordination.

I finally did some more reading outside of Cassandra space and realized HBase has what I was asking about. If Cass could be flexible enough to allow such a setup without violating it's goals, that would be great, imho.

This thread is just a brainstorming exploratory thread (by a non-expert) based on a simplistic observation that, if all clients went directly to the responsible replica every time, then performance and consistency can be increased by:

- providing guaranteed monotonic reads/writes consistency
- read-your-writes consistency
- higher performance (less latency)

all with only a read/write of ONE.

Basically, it's like a mater/slave setup except that the slaves can take-over as master, so you still have high availability.

I'm not saying it's easy and I'm only coming at this from a customer request point of view. The question is, would this be useful if it could be added to Cass's bag of tricks? Cass is already a hybrid.

aj

On 7/2/2011 1:57 PM, Yang wrote:

Jonathan:

could you please elaborate more on specifically why they are "not even close"? --- I kind of see what you mean (please correct me if I misunderstood): Cassandra failure detector is consulted on every write; while HBase failure detector is only used when the tablet server joins or leaves.

in order to have the single write entry point approach originally brought up in this thread, I think you need a strong membership protocol to lock on the key range leadership, once leadership is acquired,
failure detectors do not need to be consulted on every write.

yes by definition of the original requirement brought up in this thread,
Cassandra's write behavior is going to be changed, to be more like Hbase, and mongo in "replica set" mode. but it seems that this leader mode can even co-exist with the multi-entry write mode that Cassandra uses now, just as you can use different CL for each single write request. in that case you would need to keep both the current lightweight Phi-detector
and add the ZK for leader election for single-entry mode write.

Thanks
Yang


(I should correct my terminology .... it's not a "strong failure detector" that's needed, it's a "strong membership protocol". strongly complete and accurate failure detectors do not exist in async distributed systems (Tushar Chandra "Unreliable Failure Detectors for Reliable Distributed Systems, Journal of the ACM, 43(2):225-267, 1996 <http://doi.acm.org/10.1145/226643.226647>" and FLP "Impossibility of Distributed Consensus with One Faulty Process <http://www.podc.org/influential/2001.html>" ) )


On Sat, Jul 2, 2011 at 10:11 AM, Jonathan Ellis <jbel...@gmail.com <mailto:jbel...@gmail.com>> wrote:

    The way HBase uses ZK (for master election) is not even close to how
    Cassandra uses the failure detector.

    Using ZK for each operation would (a) not scale and (b) not work
    cross-DC for any reasonable latency requirements.

    On Sat, Jul 2, 2011 at 11:55 AM, Yang <teddyyyy...@gmail.com
    <mailto:teddyyyy...@gmail.com>> wrote:
    > there is a JIRA completed in 0.7.x that "Prefers" a certain node
    in snitch,
    > so this does roughly what you want MOST of the time
    >
    > but the problem is that it does not GUARANTEE that the same node
    will always
    > be read.  I recently read into the HBase vs Cassandra comparison
    thread that
    > started after Facebook dropped Cassandra for their messaging
    system, and
    > understood some of the differences. what you want is essentially
    what HBase
    > does. the fundamental difference there is really due to the
    gossip protocol:
    > it's a probablistic, or eventually consistent failure detector
     while
    > HBase/Google Bigtable use Zookeeper/Chubby to provide a strong
    failure
    > detector (a distributed lock).  so in HBase, if a tablet server
    goes down,
    > it really goes down, it can not re-grab the tablet from the new
    tablet
    > server without going through a start up protocol (notifying the
    master,
    > which would notify the clients etc),  in other words it is
    guaranteed that
    > one tablet is served by only one tablet server at any given
    time.  in
    > comparison the above JIRA only TRYIES to serve that key from one
    particular
    > replica. HBase can have that guarantee because the group
    membership is
    > maintained by the strong failure detector.
    > just for hacking curiosity, a strong failure detector +
    Cassandra replicas
    > is not impossible (actually seems not difficult), although the
    performance
    > is not clear. what would such a strong failure detector bring to
    Cassandra
    > besides this ONE-ONE strong consistency ? that is an interesting
    question I
    > think.
    > considering that HBase has been deployed on big clusters, it is
    probably OK
    > with the performance of the strong  Zookeeper failure detector.
    then a
    > further question was: why did Dynamo originally choose to use the
    > probablistic failure detector? yes Dynamo's main theme is
    "eventually
    > consistent", so the Phi-detector is **enough**, but if a strong
    detector
    > buys us more with little cost, wouldn't that  be great?
    >
    >
    > On Fri, Jul 1, 2011 at 6:53 PM, AJ <a...@dude.podzone.net
    <mailto:a...@dude.podzone.net>> wrote:
    >>
    >> Is this possible?
    >>
    >> All reads and writes for a given key will always go to the same
    node from
    >> a client.  It seems the only thing needed is to allow the
    clients to compute
    >> which node is the closes replica for the given key using the
    same algorithm
    >> C* uses.  When the first replica receives the write request, it
    will write
    >> to itself which should complete before any of the other
    replicas and then
    >> return.  The loads should still stay balanced if using random
    partitioner.
    >>  If the first replica becomes unavailable (however that is
    defined), then
    >> the clients can send to the next repilca in the ring and switch
    from ONE
    >> write/reads to QUORUM write/reads temporarily until the first
    replica
    >> becomes available again.  QUORUM is required since there could
    be some
    >> replicas that were not updated after the first replica went down.
    >>
    >> Will this work?  The goal is to have strong consistency with a
    read/write
    >> consistency level as low as possible while secondarily a
    network performance
    >> boost.
    >
    >



    --
    Jonathan Ellis
    Project Chair, Apache Cassandra
    co-founder of DataStax, the source for professional Cassandra support
    http://www.datastax.com



Reply via email to