Re: Consistency model

Sean Bridges Sun, 17 Apr 2011 07:45:58 -0700

On Sun, Apr 17, 2011 at 6:42 AM, William Oberman
<ober...@civicscience.com> wrote:
> At first I was concerned and was going to +1  on a fix, but I think I was
> confused on one detail and I'd like to confirm it.
> -An unsuccessful write implies readers can see either the old or new value


Yes.  Fixing CASSANDRA-2494 will simply mean that once the new value
is seen in a quorum read, all future quorum reads will see it.

> The trick is using a library, it sounds like there is a period of time a
> write is unsuccessful but you don't know about it (as the retry is
> internal).  But, (assuming writes are idempotent) QUORUM is actually
> consistent from successful writes to successful reads... right?

Yes, a successful quorum write implies that future quorum reads will
see the write.

Sean


> On Sun, Apr 17, 2011 at 1:53 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>
>> Tyler is correct, because Cassandra doesn't wait until repair writes
>> are acked before the answer is returned. This is something we can fix.
>>
>> On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges <sean.brid...@gmail.com>
>> wrote:
>> > Tyler, your answer seems to contradict this email by Jonathan Ellis
>> > [1].  In it Jonathan says,
>> >
>> > "The important guarantee this gives you is that once one quorum read
>> > sees the new value, all others will too.   You can't see the newest
>> > version, then see an older version on a subsequent write [sic, I
>> > assume he meant read], which is the characteristic of non-strong
>> > consistency"
>> >
>> > Jonathan also says,
>> >
>> > "{X, Y} and {X, Z} are equivalent: one node with the write, and one
>> > without. The read will recognize that X's version needs to be sent to
>> > Z, and the write will be complete.  This read and all subsequent ones
>> > will see the write.  (Z [sic, I assume he meant Y] will be replicated
>> > to asynchronously via read repair.)"
>> >
>> > To me, the statement "this read and all subsequent ones will see the
>> > write" implies that the new value must be committed to Y or Z before
>> > the read can return.  If not, the statement must be false.
>> >
>> > Sean
>> >
>> >
>> > [1] :
>> > http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3caanlktimegp8h87mgs_bxzknck-a59whxf-xx58hca...@mail.gmail.com%3E
>> >
>> > Sean
>> >
>> > On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs <ty...@datastax.com> wrote:
>> >> Here's what's probably happening:
>> >>
>> >> I'm assuming RF=3 and QUORUM writes/reads here.  I'll call the replicas
>> >> A,
>> >> B, and C.
>> >>
>> >> 1.  Writer process writes sequence number 1 and everything works fine.
>> >> A,
>> >> B, and C all have sequence number 1.
>> >> 2.  Writer process writes sequence number 2.  Replica A writes
>> >> successfully,
>> >> B and C fail to respond in time, and a TimedOutException is returned.
>> >> pycassa waits to retry the operation.
>> >> 3.  Reader process reads, gets a response from A and B.  When the row
>> >> from A
>> >> and B is merged, sequence number 2 is the newest and is returned.  A
>> >> read
>> >> repair is pushed to B and C, but they don't yet update their data.
>> >> 4.  Reader process reads again, gets a response from B and C (before
>> >> they've
>> >> repaired).  These both report sequence number 1, so that's returned to
>> >> the
>> >> client.  This is were you get a decreasing sequence number.
>> >> 5.  pycassa eventually retries the write; B and C eventually repair
>> >> their
>> >> data.  Either way, both B and C shortly have sequence number 2.
>> >>
>> >> I've left out some of the details of read repair, and this scenario
>> >> could
>> >> happen in several slightly different ways, but it should give you an
>> >> idea of
>> >> what's happening.
>> >>
>> >> On Sat, Apr 16, 2011 at 8:35 PM, James Cipar <jci...@cmu.edu> wrote:
>> >>>
>> >>> Here it is.  There is some setup code and global variable definitions
>> >>> that
>> >>> I left out of the previous code, but they are pretty similar to the
>> >>> setup
>> >>> code here.
>> >>>     import pycassa
>> >>>     import random
>> >>>     import time
>> >>>     consistency_level =
>> >>> pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
>> >>>     duration = 600
>> >>>     sleeptime = 0.0
>> >>>     hostlist = 'worker-hostlist'
>> >>>     def read_servers(fn):
>> >>>         f = open(fn)
>> >>>         servers = []
>> >>>         for line in f:
>> >>>             servers.append(line.strip())
>> >>>         f.close()
>> >>>         return servers
>> >>>     servers = read_servers(hostlist)
>> >>>     start_time = time.time()
>> >>>     seqnum = -1
>> >>>     timestamp = 0
>> >>>     while time.time() < start_time + duration:
>> >>>         target_server = random.sample(servers, 1)[0]
>> >>>         target_server = '%s:9160'%target_server
>> >>>         try:
>> >>>             pool = pycassa.connect('Keyspace1', [target_server])
>> >>>             cf = pycassa.ColumnFamily(pool, 'Standard1')
>> >>>             row = cf.get('foo',
>> >>> read_consistency_level=consistency_level)
>> >>>             pool.dispose()
>> >>>         except:
>> >>>             time.sleep(sleeptime)
>> >>>             continue
>> >>>         sq = int(row['seqnum'])
>> >>>         ts = float(row['timestamp'])
>> >>>         if sq < seqnum:
>> >>>             print 'Row changed: %i %f -> %i %f'%(seqnum, timestamp,
>> >>> sq,
>> >>> ts)
>> >>>         seqnum = sq
>> >>>         timestamp = ts
>> >>>         if sleeptime > 0.0:
>> >>>             time.sleep(sleeptime)
>> >>>
>> >>>
>> >>>
>> >>> On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote:
>> >>>
>> >>> James,
>> >>>
>> >>> Would you mind sharing your reader process code as well?
>> >>>
>> >>> On Fri, Apr 15, 2011 at 1:14 PM, James Cipar <jci...@cmu.edu> wrote:
>> >>>>
>> >>>> I've been experimenting with the consistency model of Cassandra, and
>> >>>> I
>> >>>> found something that seems a bit unexpected.  In my experiment, I
>> >>>> have 2
>> >>>> processes, a reader and a writer, each accessing a Cassandra cluster
>> >>>> with a
>> >>>> replication factor greater than 1.  In addition, sometimes I generate
>> >>>> background traffic to simulate a busy cluster by uploading a large
>> >>>> data file
>> >>>> to another table.
>> >>>>
>> >>>> The writer executes a loop where it writes a single row that contains
>> >>>> just an sequentially increasing sequence number and a timestamp.  In
>> >>>> python
>> >>>> this looks something like:
>> >>>>
>> >>>>    while time.time() < start_time + duration:
>> >>>>        target_server = random.sample(servers, 1)[0]
>> >>>>        target_server = '%s:9160'%target_server
>> >>>>
>> >>>>        row = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
>> >>>>        seqnum += 1
>> >>>>        # print 'uploading to server %s, %s'%(target_server, row)
>> >>>>
>> >>>>        pool = pycassa.connect('Keyspace1', [target_server])
>> >>>>        cf = pycassa.ColumnFamily(pool, 'Standard1')
>> >>>>        cf.insert('foo', row,
>> >>>> write_consistency_level=consistency_level)
>> >>>>        pool.dispose()
>> >>>>
>> >>>>        if sleeptime > 0.0:
>> >>>>            time.sleep(sleeptime)
>> >>>>
>> >>>>
>> >>>> The reader simply executes a loop reading this row and reporting
>> >>>> whenever
>> >>>> a sequence number is *less* than the previous sequence number.  As
>> >>>> expected,
>> >>>> with consistency_level=ConsistencyLevel.ONE there are many
>> >>>> inconsistencies,
>> >>>> especially with a high replication factor.
>> >>>>
>> >>>> What is unexpected is that I still detect inconsistencies when it is
>> >>>> set
>> >>>> at ConsistencyLevel.QUORUM.  This is unexpected because the
>> >>>> documentation
>> >>>> seems to imply that QUORUM will give consistent results.  With
>> >>>> background
>> >>>> traffic the average difference in timestamps was 0.6s, and the
>> >>>> maximum was
>> >>>> >3.5s.  This means that a client sees a version of the row, and can
>> >>>> subsequently see another version of the row that is 3.5s older than
>> >>>> the
>> >>>> previous.
>> >>>>
>> >>>> What I imagine is happening is this, but I'd like someone who knows
>> >>>> that
>> >>>> they're talking about to tell me if it's actually the case:
>> >>>>
>> >>>> I think Cassandra is not using an atomic commit protocol to commit to
>> >>>> the
>> >>>> quorum of servers chosen when the write is made.  This means that at
>> >>>> some
>> >>>> point in the middle of the write, some subset of the quorum have seen
>> >>>> the
>> >>>> write, while others have not.  At this time, there is a quorum of
>> >>>> servers
>> >>>> that have not seen the update, so depending on which quorum the
>> >>>> client reads
>> >>>> from, it may or may not see the update.
>> >>>>
>> >>>> Of course, I understand that the client is not *choosing* a bad
>> >>>> quorum to
>> >>>> read from, it is just the first `q` servers to respond, but in this
>> >>>> case it
>> >>>> is effectively random and sometimes an bad quorum is "chosen".
>> >>>>
>> >>>> Does anyone have any other insight into what is going on here?
>> >>>
>> >>>
>> >>> --
>> >>> Tyler Hobbs
>> >>> Software Engineer, DataStax
>> >>> Maintainer of the pycassa Cassandra Python client library
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Tyler Hobbs
>> >> Software Engineer, DataStax
>> >> Maintainer of the pycassa Cassandra Python client library
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) ober...@civicscience.com
>

Re: Consistency model

Reply via email to