Tyler is correct, because Cassandra doesn't wait until repair writes are acked before the answer is returned. This is something we can fix.
On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges <sean.brid...@gmail.com> wrote: > Tyler, your answer seems to contradict this email by Jonathan Ellis > [1]. In it Jonathan says, > > "The important guarantee this gives you is that once one quorum read > sees the new value, all others will too. You can't see the newest > version, then see an older version on a subsequent write [sic, I > assume he meant read], which is the characteristic of non-strong > consistency" > > Jonathan also says, > > "{X, Y} and {X, Z} are equivalent: one node with the write, and one > without. The read will recognize that X's version needs to be sent to > Z, and the write will be complete. This read and all subsequent ones > will see the write. (Z [sic, I assume he meant Y] will be replicated > to asynchronously via read repair.)" > > To me, the statement "this read and all subsequent ones will see the > write" implies that the new value must be committed to Y or Z before > the read can return. If not, the statement must be false. > > Sean > > > [1] : > http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3caanlktimegp8h87mgs_bxzknck-a59whxf-xx58hca...@mail.gmail.com%3E > > Sean > > On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs <ty...@datastax.com> wrote: >> Here's what's probably happening: >> >> I'm assuming RF=3 and QUORUM writes/reads here. I'll call the replicas A, >> B, and C. >> >> 1. Writer process writes sequence number 1 and everything works fine. A, >> B, and C all have sequence number 1. >> 2. Writer process writes sequence number 2. Replica A writes successfully, >> B and C fail to respond in time, and a TimedOutException is returned. >> pycassa waits to retry the operation. >> 3. Reader process reads, gets a response from A and B. When the row from A >> and B is merged, sequence number 2 is the newest and is returned. A read >> repair is pushed to B and C, but they don't yet update their data. >> 4. Reader process reads again, gets a response from B and C (before they've >> repaired). These both report sequence number 1, so that's returned to the >> client. This is were you get a decreasing sequence number. >> 5. pycassa eventually retries the write; B and C eventually repair their >> data. Either way, both B and C shortly have sequence number 2. >> >> I've left out some of the details of read repair, and this scenario could >> happen in several slightly different ways, but it should give you an idea of >> what's happening. >> >> On Sat, Apr 16, 2011 at 8:35 PM, James Cipar <jci...@cmu.edu> wrote: >>> >>> Here it is. There is some setup code and global variable definitions that >>> I left out of the previous code, but they are pretty similar to the setup >>> code here. >>> import pycassa >>> import random >>> import time >>> consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM >>> duration = 600 >>> sleeptime = 0.0 >>> hostlist = 'worker-hostlist' >>> def read_servers(fn): >>> f = open(fn) >>> servers = [] >>> for line in f: >>> servers.append(line.strip()) >>> f.close() >>> return servers >>> servers = read_servers(hostlist) >>> start_time = time.time() >>> seqnum = -1 >>> timestamp = 0 >>> while time.time() < start_time + duration: >>> target_server = random.sample(servers, 1)[0] >>> target_server = '%s:9160'%target_server >>> try: >>> pool = pycassa.connect('Keyspace1', [target_server]) >>> cf = pycassa.ColumnFamily(pool, 'Standard1') >>> row = cf.get('foo', read_consistency_level=consistency_level) >>> pool.dispose() >>> except: >>> time.sleep(sleeptime) >>> continue >>> sq = int(row['seqnum']) >>> ts = float(row['timestamp']) >>> if sq < seqnum: >>> print 'Row changed: %i %f -> %i %f'%(seqnum, timestamp, sq, >>> ts) >>> seqnum = sq >>> timestamp = ts >>> if sleeptime > 0.0: >>> time.sleep(sleeptime) >>> >>> >>> >>> On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote: >>> >>> James, >>> >>> Would you mind sharing your reader process code as well? >>> >>> On Fri, Apr 15, 2011 at 1:14 PM, James Cipar <jci...@cmu.edu> wrote: >>>> >>>> I've been experimenting with the consistency model of Cassandra, and I >>>> found something that seems a bit unexpected. In my experiment, I have 2 >>>> processes, a reader and a writer, each accessing a Cassandra cluster with a >>>> replication factor greater than 1. In addition, sometimes I generate >>>> background traffic to simulate a busy cluster by uploading a large data >>>> file >>>> to another table. >>>> >>>> The writer executes a loop where it writes a single row that contains >>>> just an sequentially increasing sequence number and a timestamp. In python >>>> this looks something like: >>>> >>>> while time.time() < start_time + duration: >>>> target_server = random.sample(servers, 1)[0] >>>> target_server = '%s:9160'%target_server >>>> >>>> row = {'seqnum':str(seqnum), 'timestamp':str(time.time())} >>>> seqnum += 1 >>>> # print 'uploading to server %s, %s'%(target_server, row) >>>> >>>> pool = pycassa.connect('Keyspace1', [target_server]) >>>> cf = pycassa.ColumnFamily(pool, 'Standard1') >>>> cf.insert('foo', row, write_consistency_level=consistency_level) >>>> pool.dispose() >>>> >>>> if sleeptime > 0.0: >>>> time.sleep(sleeptime) >>>> >>>> >>>> The reader simply executes a loop reading this row and reporting whenever >>>> a sequence number is *less* than the previous sequence number. As >>>> expected, >>>> with consistency_level=ConsistencyLevel.ONE there are many inconsistencies, >>>> especially with a high replication factor. >>>> >>>> What is unexpected is that I still detect inconsistencies when it is set >>>> at ConsistencyLevel.QUORUM. This is unexpected because the documentation >>>> seems to imply that QUORUM will give consistent results. With background >>>> traffic the average difference in timestamps was 0.6s, and the maximum was >>>> >3.5s. This means that a client sees a version of the row, and can >>>> subsequently see another version of the row that is 3.5s older than the >>>> previous. >>>> >>>> What I imagine is happening is this, but I'd like someone who knows that >>>> they're talking about to tell me if it's actually the case: >>>> >>>> I think Cassandra is not using an atomic commit protocol to commit to the >>>> quorum of servers chosen when the write is made. This means that at some >>>> point in the middle of the write, some subset of the quorum have seen the >>>> write, while others have not. At this time, there is a quorum of servers >>>> that have not seen the update, so depending on which quorum the client >>>> reads >>>> from, it may or may not see the update. >>>> >>>> Of course, I understand that the client is not *choosing* a bad quorum to >>>> read from, it is just the first `q` servers to respond, but in this case it >>>> is effectively random and sometimes an bad quorum is "chosen". >>>> >>>> Does anyone have any other insight into what is going on here? >>> >>> >>> -- >>> Tyler Hobbs >>> Software Engineer, DataStax >>> Maintainer of the pycassa Cassandra Python client library >>> >>> >> >> >> >> -- >> Tyler Hobbs >> Software Engineer, DataStax >> Maintainer of the pycassa Cassandra Python client library >> >> > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com