Thanks Jonathan, I've filed a bug for this,

https://issues.apache.org/jira/browse/CASSANDRA-2494

Sean

On Sat, Apr 16, 2011 at 10:53 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
> Tyler is correct, because Cassandra doesn't wait until repair writes
> are acked before the answer is returned. This is something we can fix.
>
> On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges <sean.brid...@gmail.com> wrote:
>> Tyler, your answer seems to contradict this email by Jonathan Ellis
>> [1].  In it Jonathan says,
>>
>> "The important guarantee this gives you is that once one quorum read
>> sees the new value, all others will too.   You can't see the newest
>> version, then see an older version on a subsequent write [sic, I
>> assume he meant read], which is the characteristic of non-strong
>> consistency"
>>
>> Jonathan also says,
>>
>> "{X, Y} and {X, Z} are equivalent: one node with the write, and one
>> without. The read will recognize that X's version needs to be sent to
>> Z, and the write will be complete.  This read and all subsequent ones
>> will see the write.  (Z [sic, I assume he meant Y] will be replicated
>> to asynchronously via read repair.)"
>>
>> To me, the statement "this read and all subsequent ones will see the
>> write" implies that the new value must be committed to Y or Z before
>> the read can return.  If not, the statement must be false.
>>
>> Sean
>>
>>
>> [1] : 
>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3caanlktimegp8h87mgs_bxzknck-a59whxf-xx58hca...@mail.gmail.com%3E
>>
>> Sean
>>
>> On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs <ty...@datastax.com> wrote:
>>> Here's what's probably happening:
>>>
>>> I'm assuming RF=3 and QUORUM writes/reads here.  I'll call the replicas A,
>>> B, and C.
>>>
>>> 1.  Writer process writes sequence number 1 and everything works fine.  A,
>>> B, and C all have sequence number 1.
>>> 2.  Writer process writes sequence number 2.  Replica A writes successfully,
>>> B and C fail to respond in time, and a TimedOutException is returned.
>>> pycassa waits to retry the operation.
>>> 3.  Reader process reads, gets a response from A and B.  When the row from A
>>> and B is merged, sequence number 2 is the newest and is returned.  A read
>>> repair is pushed to B and C, but they don't yet update their data.
>>> 4.  Reader process reads again, gets a response from B and C (before they've
>>> repaired).  These both report sequence number 1, so that's returned to the
>>> client.  This is were you get a decreasing sequence number.
>>> 5.  pycassa eventually retries the write; B and C eventually repair their
>>> data.  Either way, both B and C shortly have sequence number 2.
>>>
>>> I've left out some of the details of read repair, and this scenario could
>>> happen in several slightly different ways, but it should give you an idea of
>>> what's happening.
>>>
>>> On Sat, Apr 16, 2011 at 8:35 PM, James Cipar <jci...@cmu.edu> wrote:
>>>>
>>>> Here it is.  There is some setup code and global variable definitions that
>>>> I left out of the previous code, but they are pretty similar to the setup
>>>> code here.
>>>>     import pycassa
>>>>     import random
>>>>     import time
>>>>     consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
>>>>     duration = 600
>>>>     sleeptime = 0.0
>>>>     hostlist = 'worker-hostlist'
>>>>     def read_servers(fn):
>>>>         f = open(fn)
>>>>         servers = []
>>>>         for line in f:
>>>>             servers.append(line.strip())
>>>>         f.close()
>>>>         return servers
>>>>     servers = read_servers(hostlist)
>>>>     start_time = time.time()
>>>>     seqnum = -1
>>>>     timestamp = 0
>>>>     while time.time() < start_time + duration:
>>>>         target_server = random.sample(servers, 1)[0]
>>>>         target_server = '%s:9160'%target_server
>>>>         try:
>>>>             pool = pycassa.connect('Keyspace1', [target_server])
>>>>             cf = pycassa.ColumnFamily(pool, 'Standard1')
>>>>             row = cf.get('foo', read_consistency_level=consistency_level)
>>>>             pool.dispose()
>>>>         except:
>>>>             time.sleep(sleeptime)
>>>>             continue
>>>>         sq = int(row['seqnum'])
>>>>         ts = float(row['timestamp'])
>>>>         if sq < seqnum:
>>>>             print 'Row changed: %i %f -> %i %f'%(seqnum, timestamp, sq,
>>>> ts)
>>>>         seqnum = sq
>>>>         timestamp = ts
>>>>         if sleeptime > 0.0:
>>>>             time.sleep(sleeptime)
>>>>
>>>>
>>>>
>>>> On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote:
>>>>
>>>> James,
>>>>
>>>> Would you mind sharing your reader process code as well?
>>>>
>>>> On Fri, Apr 15, 2011 at 1:14 PM, James Cipar <jci...@cmu.edu> wrote:
>>>>>
>>>>> I've been experimenting with the consistency model of Cassandra, and I
>>>>> found something that seems a bit unexpected.  In my experiment, I have 2
>>>>> processes, a reader and a writer, each accessing a Cassandra cluster with 
>>>>> a
>>>>> replication factor greater than 1.  In addition, sometimes I generate
>>>>> background traffic to simulate a busy cluster by uploading a large data 
>>>>> file
>>>>> to another table.
>>>>>
>>>>> The writer executes a loop where it writes a single row that contains
>>>>> just an sequentially increasing sequence number and a timestamp.  In 
>>>>> python
>>>>> this looks something like:
>>>>>
>>>>>    while time.time() < start_time + duration:
>>>>>        target_server = random.sample(servers, 1)[0]
>>>>>        target_server = '%s:9160'%target_server
>>>>>
>>>>>        row = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
>>>>>        seqnum += 1
>>>>>        # print 'uploading to server %s, %s'%(target_server, row)
>>>>>
>>>>>        pool = pycassa.connect('Keyspace1', [target_server])
>>>>>        cf = pycassa.ColumnFamily(pool, 'Standard1')
>>>>>        cf.insert('foo', row, write_consistency_level=consistency_level)
>>>>>        pool.dispose()
>>>>>
>>>>>        if sleeptime > 0.0:
>>>>>            time.sleep(sleeptime)
>>>>>
>>>>>
>>>>> The reader simply executes a loop reading this row and reporting whenever
>>>>> a sequence number is *less* than the previous sequence number.  As 
>>>>> expected,
>>>>> with consistency_level=ConsistencyLevel.ONE there are many 
>>>>> inconsistencies,
>>>>> especially with a high replication factor.
>>>>>
>>>>> What is unexpected is that I still detect inconsistencies when it is set
>>>>> at ConsistencyLevel.QUORUM.  This is unexpected because the documentation
>>>>> seems to imply that QUORUM will give consistent results.  With background
>>>>> traffic the average difference in timestamps was 0.6s, and the maximum was
>>>>> >3.5s.  This means that a client sees a version of the row, and can
>>>>> subsequently see another version of the row that is 3.5s older than the
>>>>> previous.
>>>>>
>>>>> What I imagine is happening is this, but I'd like someone who knows that
>>>>> they're talking about to tell me if it's actually the case:
>>>>>
>>>>> I think Cassandra is not using an atomic commit protocol to commit to the
>>>>> quorum of servers chosen when the write is made.  This means that at some
>>>>> point in the middle of the write, some subset of the quorum have seen the
>>>>> write, while others have not.  At this time, there is a quorum of servers
>>>>> that have not seen the update, so depending on which quorum the client 
>>>>> reads
>>>>> from, it may or may not see the update.
>>>>>
>>>>> Of course, I understand that the client is not *choosing* a bad quorum to
>>>>> read from, it is just the first `q` servers to respond, but in this case 
>>>>> it
>>>>> is effectively random and sometimes an bad quorum is "chosen".
>>>>>
>>>>> Does anyone have any other insight into what is going on here?
>>>>
>>>>
>>>> --
>>>> Tyler Hobbs
>>>> Software Engineer, DataStax
>>>> Maintainer of the pycassa Cassandra Python client library
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Tyler Hobbs
>>> Software Engineer, DataStax
>>> Maintainer of the pycassa Cassandra Python client library
>>>
>>>
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Reply via email to