Thanks Jonathan, I've filed a bug for this, https://issues.apache.org/jira/browse/CASSANDRA-2494
Sean On Sat, Apr 16, 2011 at 10:53 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > Tyler is correct, because Cassandra doesn't wait until repair writes > are acked before the answer is returned. This is something we can fix. > > On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges <sean.brid...@gmail.com> wrote: >> Tyler, your answer seems to contradict this email by Jonathan Ellis >> [1]. In it Jonathan says, >> >> "The important guarantee this gives you is that once one quorum read >> sees the new value, all others will too. You can't see the newest >> version, then see an older version on a subsequent write [sic, I >> assume he meant read], which is the characteristic of non-strong >> consistency" >> >> Jonathan also says, >> >> "{X, Y} and {X, Z} are equivalent: one node with the write, and one >> without. The read will recognize that X's version needs to be sent to >> Z, and the write will be complete. This read and all subsequent ones >> will see the write. (Z [sic, I assume he meant Y] will be replicated >> to asynchronously via read repair.)" >> >> To me, the statement "this read and all subsequent ones will see the >> write" implies that the new value must be committed to Y or Z before >> the read can return. If not, the statement must be false. >> >> Sean >> >> >> [1] : >> http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3caanlktimegp8h87mgs_bxzknck-a59whxf-xx58hca...@mail.gmail.com%3E >> >> Sean >> >> On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs <ty...@datastax.com> wrote: >>> Here's what's probably happening: >>> >>> I'm assuming RF=3 and QUORUM writes/reads here. I'll call the replicas A, >>> B, and C. >>> >>> 1. Writer process writes sequence number 1 and everything works fine. A, >>> B, and C all have sequence number 1. >>> 2. Writer process writes sequence number 2. Replica A writes successfully, >>> B and C fail to respond in time, and a TimedOutException is returned. >>> pycassa waits to retry the operation. >>> 3. Reader process reads, gets a response from A and B. When the row from A >>> and B is merged, sequence number 2 is the newest and is returned. A read >>> repair is pushed to B and C, but they don't yet update their data. >>> 4. Reader process reads again, gets a response from B and C (before they've >>> repaired). These both report sequence number 1, so that's returned to the >>> client. This is were you get a decreasing sequence number. >>> 5. pycassa eventually retries the write; B and C eventually repair their >>> data. Either way, both B and C shortly have sequence number 2. >>> >>> I've left out some of the details of read repair, and this scenario could >>> happen in several slightly different ways, but it should give you an idea of >>> what's happening. >>> >>> On Sat, Apr 16, 2011 at 8:35 PM, James Cipar <jci...@cmu.edu> wrote: >>>> >>>> Here it is. There is some setup code and global variable definitions that >>>> I left out of the previous code, but they are pretty similar to the setup >>>> code here. >>>> import pycassa >>>> import random >>>> import time >>>> consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM >>>> duration = 600 >>>> sleeptime = 0.0 >>>> hostlist = 'worker-hostlist' >>>> def read_servers(fn): >>>> f = open(fn) >>>> servers = [] >>>> for line in f: >>>> servers.append(line.strip()) >>>> f.close() >>>> return servers >>>> servers = read_servers(hostlist) >>>> start_time = time.time() >>>> seqnum = -1 >>>> timestamp = 0 >>>> while time.time() < start_time + duration: >>>> target_server = random.sample(servers, 1)[0] >>>> target_server = '%s:9160'%target_server >>>> try: >>>> pool = pycassa.connect('Keyspace1', [target_server]) >>>> cf = pycassa.ColumnFamily(pool, 'Standard1') >>>> row = cf.get('foo', read_consistency_level=consistency_level) >>>> pool.dispose() >>>> except: >>>> time.sleep(sleeptime) >>>> continue >>>> sq = int(row['seqnum']) >>>> ts = float(row['timestamp']) >>>> if sq < seqnum: >>>> print 'Row changed: %i %f -> %i %f'%(seqnum, timestamp, sq, >>>> ts) >>>> seqnum = sq >>>> timestamp = ts >>>> if sleeptime > 0.0: >>>> time.sleep(sleeptime) >>>> >>>> >>>> >>>> On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote: >>>> >>>> James, >>>> >>>> Would you mind sharing your reader process code as well? >>>> >>>> On Fri, Apr 15, 2011 at 1:14 PM, James Cipar <jci...@cmu.edu> wrote: >>>>> >>>>> I've been experimenting with the consistency model of Cassandra, and I >>>>> found something that seems a bit unexpected. In my experiment, I have 2 >>>>> processes, a reader and a writer, each accessing a Cassandra cluster with >>>>> a >>>>> replication factor greater than 1. In addition, sometimes I generate >>>>> background traffic to simulate a busy cluster by uploading a large data >>>>> file >>>>> to another table. >>>>> >>>>> The writer executes a loop where it writes a single row that contains >>>>> just an sequentially increasing sequence number and a timestamp. In >>>>> python >>>>> this looks something like: >>>>> >>>>> while time.time() < start_time + duration: >>>>> target_server = random.sample(servers, 1)[0] >>>>> target_server = '%s:9160'%target_server >>>>> >>>>> row = {'seqnum':str(seqnum), 'timestamp':str(time.time())} >>>>> seqnum += 1 >>>>> # print 'uploading to server %s, %s'%(target_server, row) >>>>> >>>>> pool = pycassa.connect('Keyspace1', [target_server]) >>>>> cf = pycassa.ColumnFamily(pool, 'Standard1') >>>>> cf.insert('foo', row, write_consistency_level=consistency_level) >>>>> pool.dispose() >>>>> >>>>> if sleeptime > 0.0: >>>>> time.sleep(sleeptime) >>>>> >>>>> >>>>> The reader simply executes a loop reading this row and reporting whenever >>>>> a sequence number is *less* than the previous sequence number. As >>>>> expected, >>>>> with consistency_level=ConsistencyLevel.ONE there are many >>>>> inconsistencies, >>>>> especially with a high replication factor. >>>>> >>>>> What is unexpected is that I still detect inconsistencies when it is set >>>>> at ConsistencyLevel.QUORUM. This is unexpected because the documentation >>>>> seems to imply that QUORUM will give consistent results. With background >>>>> traffic the average difference in timestamps was 0.6s, and the maximum was >>>>> >3.5s. This means that a client sees a version of the row, and can >>>>> subsequently see another version of the row that is 3.5s older than the >>>>> previous. >>>>> >>>>> What I imagine is happening is this, but I'd like someone who knows that >>>>> they're talking about to tell me if it's actually the case: >>>>> >>>>> I think Cassandra is not using an atomic commit protocol to commit to the >>>>> quorum of servers chosen when the write is made. This means that at some >>>>> point in the middle of the write, some subset of the quorum have seen the >>>>> write, while others have not. At this time, there is a quorum of servers >>>>> that have not seen the update, so depending on which quorum the client >>>>> reads >>>>> from, it may or may not see the update. >>>>> >>>>> Of course, I understand that the client is not *choosing* a bad quorum to >>>>> read from, it is just the first `q` servers to respond, but in this case >>>>> it >>>>> is effectively random and sometimes an bad quorum is "chosen". >>>>> >>>>> Does anyone have any other insight into what is going on here? >>>> >>>> >>>> -- >>>> Tyler Hobbs >>>> Software Engineer, DataStax >>>> Maintainer of the pycassa Cassandra Python client library >>>> >>>> >>> >>> >>> >>> -- >>> Tyler Hobbs >>> Software Engineer, DataStax >>> Maintainer of the pycassa Cassandra Python client library >>> >>> >> > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com >