Expected behavior of number of nodes contacted during CL=QUORUM read

Kirk True Thu, 04 Oct 2012 11:26:09 -0700

Hi all,

Test scenario:


    4 nodes (.1, .2, .3, .4)
    RF=3
    CL=QUORUM
    1.1.2

I noticed that in ReadCallback's constructor, it determines the'blockfor' number of 2 for RF=3, CL=QUORUM.


According to the API page on the wiki[1] for reads at CL=QUORUM:

   Will query *all* replicas and return the record with the most recent
   timestamp once it has at least a majority of replicas (N / 2 + 1)
   reported.

However, in ReadCallback's constructor, it determines blockfor to be 2,then calls filterEndpoints. filterEndpoints is given a list of the threereplicas, but at the very end of the method, the endpoint list to onlytwo replicas. Those two replicas are then used in StorageProxy toexecute the read/digest calls. So it ends up as 2 nodes, not all threeas stated on the wiki.

In my test case, I kill a node and then immediately issue a query for akey that has a replica on the downed node. For the live nodes in thesystem, it doesn't immediately know that the other node is down yet.Rather than contacting *all* nodes as the wiki states, the coordinatorcontacts only two -- one of which is the downed node. Since it blocksfor two, one of which is down, the query times out. Attempting the readagain produces the same effect, even when trying different nodes ascoordinators. I end up retrying a few times until the failure detectorson the live nodes realize that the node is down.

So, the end result is that if a client attempts to read a row that has areplica on a newly downed node, it will timeout repeatedly until the ~30seconds failure detector window has passed -- even though there areenough live replicas to satisfy the request. We basically have ascenario wherein a value is not retrievable for upwards of 30 seconds.The percentage of keys that exhibit this possibility shrinks as the ringgrows, but it's still non-zero.


This doesn't seem right and I'm sure I'm missing something.

Thanks,
Kirk

[1] http://wiki.apache.org/cassandra/API

Expected behavior of number of nodes contacted during CL=QUORUM read

Reply via email to