On Sun, Apr 17, 2011 at 6:42 AM, William Oberman <ober...@civicscience.com> wrote: > At first I was concerned and was going to +1 on a fix, but I think I was > confused on one detail and I'd like to confirm it. > -An unsuccessful write implies readers can see either the old or new value
Yes. Fixing CASSANDRA-2494 will simply mean that once the new value is seen in a quorum read, all future quorum reads will see it. > The trick is using a library, it sounds like there is a period of time a > write is unsuccessful but you don't know about it (as the retry is > internal). But, (assuming writes are idempotent) QUORUM is actually > consistent from successful writes to successful reads... right? Yes, a successful quorum write implies that future quorum reads will see the write. Sean > On Sun, Apr 17, 2011 at 1:53 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >> >> Tyler is correct, because Cassandra doesn't wait until repair writes >> are acked before the answer is returned. This is something we can fix. >> >> On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges <sean.brid...@gmail.com> >> wrote: >> > Tyler, your answer seems to contradict this email by Jonathan Ellis >> > [1]. In it Jonathan says, >> > >> > "The important guarantee this gives you is that once one quorum read >> > sees the new value, all others will too. You can't see the newest >> > version, then see an older version on a subsequent write [sic, I >> > assume he meant read], which is the characteristic of non-strong >> > consistency" >> > >> > Jonathan also says, >> > >> > "{X, Y} and {X, Z} are equivalent: one node with the write, and one >> > without. The read will recognize that X's version needs to be sent to >> > Z, and the write will be complete. This read and all subsequent ones >> > will see the write. (Z [sic, I assume he meant Y] will be replicated >> > to asynchronously via read repair.)" >> > >> > To me, the statement "this read and all subsequent ones will see the >> > write" implies that the new value must be committed to Y or Z before >> > the read can return. If not, the statement must be false. >> > >> > Sean >> > >> > >> > [1] : >> > http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3caanlktimegp8h87mgs_bxzknck-a59whxf-xx58hca...@mail.gmail.com%3E >> > >> > Sean >> > >> > On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs <ty...@datastax.com> wrote: >> >> Here's what's probably happening: >> >> >> >> I'm assuming RF=3 and QUORUM writes/reads here. I'll call the replicas >> >> A, >> >> B, and C. >> >> >> >> 1. Writer process writes sequence number 1 and everything works fine. >> >> A, >> >> B, and C all have sequence number 1. >> >> 2. Writer process writes sequence number 2. Replica A writes >> >> successfully, >> >> B and C fail to respond in time, and a TimedOutException is returned. >> >> pycassa waits to retry the operation. >> >> 3. Reader process reads, gets a response from A and B. When the row >> >> from A >> >> and B is merged, sequence number 2 is the newest and is returned. A >> >> read >> >> repair is pushed to B and C, but they don't yet update their data. >> >> 4. Reader process reads again, gets a response from B and C (before >> >> they've >> >> repaired). These both report sequence number 1, so that's returned to >> >> the >> >> client. This is were you get a decreasing sequence number. >> >> 5. pycassa eventually retries the write; B and C eventually repair >> >> their >> >> data. Either way, both B and C shortly have sequence number 2. >> >> >> >> I've left out some of the details of read repair, and this scenario >> >> could >> >> happen in several slightly different ways, but it should give you an >> >> idea of >> >> what's happening. >> >> >> >> On Sat, Apr 16, 2011 at 8:35 PM, James Cipar <jci...@cmu.edu> wrote: >> >>> >> >>> Here it is. There is some setup code and global variable definitions >> >>> that >> >>> I left out of the previous code, but they are pretty similar to the >> >>> setup >> >>> code here. >> >>> import pycassa >> >>> import random >> >>> import time >> >>> consistency_level = >> >>> pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM >> >>> duration = 600 >> >>> sleeptime = 0.0 >> >>> hostlist = 'worker-hostlist' >> >>> def read_servers(fn): >> >>> f = open(fn) >> >>> servers = [] >> >>> for line in f: >> >>> servers.append(line.strip()) >> >>> f.close() >> >>> return servers >> >>> servers = read_servers(hostlist) >> >>> start_time = time.time() >> >>> seqnum = -1 >> >>> timestamp = 0 >> >>> while time.time() < start_time + duration: >> >>> target_server = random.sample(servers, 1)[0] >> >>> target_server = '%s:9160'%target_server >> >>> try: >> >>> pool = pycassa.connect('Keyspace1', [target_server]) >> >>> cf = pycassa.ColumnFamily(pool, 'Standard1') >> >>> row = cf.get('foo', >> >>> read_consistency_level=consistency_level) >> >>> pool.dispose() >> >>> except: >> >>> time.sleep(sleeptime) >> >>> continue >> >>> sq = int(row['seqnum']) >> >>> ts = float(row['timestamp']) >> >>> if sq < seqnum: >> >>> print 'Row changed: %i %f -> %i %f'%(seqnum, timestamp, >> >>> sq, >> >>> ts) >> >>> seqnum = sq >> >>> timestamp = ts >> >>> if sleeptime > 0.0: >> >>> time.sleep(sleeptime) >> >>> >> >>> >> >>> >> >>> On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote: >> >>> >> >>> James, >> >>> >> >>> Would you mind sharing your reader process code as well? >> >>> >> >>> On Fri, Apr 15, 2011 at 1:14 PM, James Cipar <jci...@cmu.edu> wrote: >> >>>> >> >>>> I've been experimenting with the consistency model of Cassandra, and >> >>>> I >> >>>> found something that seems a bit unexpected. In my experiment, I >> >>>> have 2 >> >>>> processes, a reader and a writer, each accessing a Cassandra cluster >> >>>> with a >> >>>> replication factor greater than 1. In addition, sometimes I generate >> >>>> background traffic to simulate a busy cluster by uploading a large >> >>>> data file >> >>>> to another table. >> >>>> >> >>>> The writer executes a loop where it writes a single row that contains >> >>>> just an sequentially increasing sequence number and a timestamp. In >> >>>> python >> >>>> this looks something like: >> >>>> >> >>>> while time.time() < start_time + duration: >> >>>> target_server = random.sample(servers, 1)[0] >> >>>> target_server = '%s:9160'%target_server >> >>>> >> >>>> row = {'seqnum':str(seqnum), 'timestamp':str(time.time())} >> >>>> seqnum += 1 >> >>>> # print 'uploading to server %s, %s'%(target_server, row) >> >>>> >> >>>> pool = pycassa.connect('Keyspace1', [target_server]) >> >>>> cf = pycassa.ColumnFamily(pool, 'Standard1') >> >>>> cf.insert('foo', row, >> >>>> write_consistency_level=consistency_level) >> >>>> pool.dispose() >> >>>> >> >>>> if sleeptime > 0.0: >> >>>> time.sleep(sleeptime) >> >>>> >> >>>> >> >>>> The reader simply executes a loop reading this row and reporting >> >>>> whenever >> >>>> a sequence number is *less* than the previous sequence number. As >> >>>> expected, >> >>>> with consistency_level=ConsistencyLevel.ONE there are many >> >>>> inconsistencies, >> >>>> especially with a high replication factor. >> >>>> >> >>>> What is unexpected is that I still detect inconsistencies when it is >> >>>> set >> >>>> at ConsistencyLevel.QUORUM. This is unexpected because the >> >>>> documentation >> >>>> seems to imply that QUORUM will give consistent results. With >> >>>> background >> >>>> traffic the average difference in timestamps was 0.6s, and the >> >>>> maximum was >> >>>> >3.5s. This means that a client sees a version of the row, and can >> >>>> subsequently see another version of the row that is 3.5s older than >> >>>> the >> >>>> previous. >> >>>> >> >>>> What I imagine is happening is this, but I'd like someone who knows >> >>>> that >> >>>> they're talking about to tell me if it's actually the case: >> >>>> >> >>>> I think Cassandra is not using an atomic commit protocol to commit to >> >>>> the >> >>>> quorum of servers chosen when the write is made. This means that at >> >>>> some >> >>>> point in the middle of the write, some subset of the quorum have seen >> >>>> the >> >>>> write, while others have not. At this time, there is a quorum of >> >>>> servers >> >>>> that have not seen the update, so depending on which quorum the >> >>>> client reads >> >>>> from, it may or may not see the update. >> >>>> >> >>>> Of course, I understand that the client is not *choosing* a bad >> >>>> quorum to >> >>>> read from, it is just the first `q` servers to respond, but in this >> >>>> case it >> >>>> is effectively random and sometimes an bad quorum is "chosen". >> >>>> >> >>>> Does anyone have any other insight into what is going on here? >> >>> >> >>> >> >>> -- >> >>> Tyler Hobbs >> >>> Software Engineer, DataStax >> >>> Maintainer of the pycassa Cassandra Python client library >> >>> >> >>> >> >> >> >> >> >> >> >> -- >> >> Tyler Hobbs >> >> Software Engineer, DataStax >> >> Maintainer of the pycassa Cassandra Python client library >> >> >> >> >> > >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of DataStax, the source for professional Cassandra support >> http://www.datastax.com > > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) ober...@civicscience.com >