Re: Strong Consistency with ONE read/writes

AJ Sun, 03 Jul 2011 20:26:41 -0700

We seem to be having a fundamental misunderstanding. Thanks for yourcomments. aj


On 7/3/2011 8:28 PM, William Oberman wrote:

I'm using cassandra as a tool, like a black box with a certaincontract to the world. Without modifying the "core", C* will send theupdates to all replicas, so your plan would cause the extra write (forthe placeholder). I wasn't assuming a modification to how C*fundamentally works.

Sounds like you are hacking (or at least looking) at the source, soall the power to you if/when you try these kind of changes.


will

On Sun, Jul 3, 2011 at 8:45 PM, AJ <a...@dude.podzone.net<mailto:a...@dude.podzone.net>> wrote:


    On 7/3/2011 6:32 PM, William Oberman wrote:

    Was just going off of: " Send the value to the primary replica
    and send placeholder values to the other replicas".  Sounded like
    you wanted to write the value to one, and write the placeholder
    to N-1 to me.


    Yes, that is what I was suggesting.  The point of the placeholders
    is to handle the crash case that I talked about... "like" a WAL does.

    But, C* will propagate the value to N-1 eventually anyways,
    'cause that's just what it does anyways :-)

    will

    On Sun, Jul 3, 2011 at 7:47 PM, AJ <a...@dude.podzone.net
    <mailto:a...@dude.podzone.net>> wrote:

        On 7/3/2011 3:49 PM, Will Oberman wrote:

        Why not send the value itself instead of a placeholder?  Now
        it takes 2x writes on a random node to do a single update
        (write placeholder, write update) and N*x writes from the
        client (write value, write placeholder to N-1). Where N is
        replication factor.  Seems like extra network and IO instead
        of less...


        To send the value to each node is 1.) unnecessary, 2.) will
        only cause a large burst of network traffic.  Think about if
        it's a large data value, such as a document.  Just let C* do
        it's thing.  The extra messages are tiny and doesn't
        significantly increase latency since they are all sent
        asynchronously.

        Of course, I still think this sounds like reimplementing
        Cassandra internals in a Cassandra client (just guessing,
        I'm not a cassandra dev)


        I don't see how.  Maybe you should take a peek at the source.


        On Jul 3, 2011, at 5:20 PM, AJ <a...@dude.podzone.net
        <mailto:a...@dude.podzone.net>> wrote:

        Yang,

        How would you deal with the problem when the 1st node
        responds success but then crashes before completely
        forwarding any replicas?  Then, after switching to the next
        primary, a read would return stale data.

        Here's a quick-n-dirty way:  Send the value to the primary

replica and send placeholder values to the other replicas.The placeholder value is something like, "PENDING_UPDATE".The placeholder values are sent with timestamps 1 less than

        the timestamp for the actual value that went to the
        primary.  Later, when the changes propagate, the actual
        values will overwrite the placeholders.  In event of a
        crash before the placeholder gets overwritten, the next
        read value will tell the client so.  The client will report
        to the user that the key/column is unavailable.  The
        downside is you've overwritten your data and maybe would
        like to know what the old data was!  But, maybe there's
        another way using other columns or with MVCC.  The client
        would want a success from the primary and the secondary
        replicas to be certain of future read consistency in case
        the primary goes down immediately as I said above.  The
        ability to set an "update_pending" flag on any column value
        would probably make this work.  But, I'll think more on
        this later.

        aj

        On 7/2/2011 10:55 AM, Yang wrote:

        there is a JIRA completed in 0.7.x that "Prefers" a
        certain node in snitch, so this does roughly what you want
        MOST of the time


        but the problem is that it does not GUARANTEE that the
        same node will always be read.  I recently read into the
        HBase vs Cassandra comparison thread that started after
        Facebook dropped Cassandra for their messaging system, and
        understood some of the differences. what you want is
        essentially what HBase does. the fundamental difference
        there is really due to the gossip protocol: it's a
        probablistic, or eventually consistent failure detector
         while HBase/Google Bigtable use Zookeeper/Chubby to
        provide a strong failure detector (a distributed lock).
         so in HBase, if a tablet server goes down, it really goes
        down, it can not re-grab the tablet from the new tablet
        server without going through a start up protocol
        (notifying the master, which would notify the clients
        etc),  in other words it is guaranteed that one tablet is
        served by only one tablet server at any given time.  in
        comparison the above JIRA only TRYIES to serve that key
        from one particular replica. HBase can have that guarantee
        because the group membership is maintained by the strong
        failure detector.

        just for hacking curiosity, a strong failure detector +
        Cassandra replicas is not impossible (actually seems not
        difficult), although the performance is not clear. what
        would such a strong failure detector bring to Cassandra
        besides this ONE-ONE strong consistency ? that is an
        interesting question I think.

        considering that HBase has been deployed on big clusters,
        it is probably OK with the performance of the strong
         Zookeeper failure detector. then a further question was:
        why did Dynamo originally choose to use the probablistic
        failure detector? yes Dynamo's main theme is "eventually
        consistent", so the Phi-detector is **enough**, but if a
        strong detector buys us more with little cost, wouldn't
        that  be great?



        On Fri, Jul 1, 2011 at 6:53 PM, AJ <a...@dude.podzone.net
        <mailto:a...@dude.podzone.net>> wrote:

            Is this possible?

            All reads and writes for a given key will always go to
            the same node from a client.  It seems the only thing
            needed is to allow the clients to compute which node
            is the closes replica for the given key using the same
            algorithm C* uses.  When the first replica receives
            the write request, it will write to itself which
            should complete before any of the other replicas and
            then return.  The loads should still stay balanced if
            using random partitioner.  If the first replica
            becomes unavailable (however that is defined), then
            the clients can send to the next repilca in the ring
            and switch from ONE write/reads to QUORUM write/reads
            temporarily until the first replica becomes available
            again.  QUORUM is required since there could be some
            replicas that were not updated after the first replica
            went down.

            Will this work?  The goal is to have strong
            consistency with a read/write consistency level as low
            as possible while secondarily a network performance boost.

--Will Oberman

    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835 <tel:412-480-7835>
    (E) ober...@civicscience.com <mailto:ober...@civicscience.com>





--
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com <mailto:ober...@civicscience.com>

Re: Strong Consistency with ONE read/writes

Reply via email to