Hi Peter

I should have started with the why instead of what ...

Background Info (I try to be brief ...)

We have a very small production cluster (started with 3 nodes, now we have 5). 
Most of our data is currently in mysql but we want to slowly move the larger 
tables which are killing our mysql cache to cassandra. After migration of 
another table we faced under-capacity last week. Since we couldn't cope with it 
and the old system was still running in the background we switched back and 
added another 2 nodes while still performing writes to the cluster (but no 
reads).

We are entirely IO-bound. What killed us last week were to many reads combined 
with flushes and compactions. Reducing compaction priority helped but it was 
not enough. Them main problem why we could not add nodes though had to do with 
the quorum reads we are doing:

First we stopped compaction on all nodes. Everything was golden. The cluster 
handled the load easily. Than we bootstrapped a new node. That increased the 
IO-pressure on the node which was selected as streaming source because it 
started anti-compcation. The increase pressure slowed the node down. Just a 
little, but enough to get flooded by digest requests from the other nodes. We 
have seen this before: 
http://comments.gmane.org/gmane.comp.db.cassandra.user/10687.

So the status at that point was: 2 nodes that were serving requests at 10 - 
50ms and one that errr... wouldn't serve requests (average response time 
storage proxy was 10 secs). The problem here was that users would suffer from 
the slow server because it was not down and still being queried from clients. 
Also because the streaming node was so overwhelmed anticompaction became *real* 
slow. It would have taken days to finish.

Than we took one node down to prevent digest flooding and it was much better. 
It almost worked out but at peak hours it collapsed. At this point we rolled 
back.

From this my learning / guessing is:

We could have survived this if the streaming node would not have had to serve 

- read requests (because than users would not have been affected) and / or
- digest requests (because that would have reduced io pressure)

To summarize: I want to prevent that the slowest node

A) affects latency at the level of StorageProxy
B) gets digest requests because the only thing they are good for is killing it 


Ok sorry - that was not brief ...

Back to my original mail. 1-3) were supposed to describe the current situation 
in order to validate that my understanding is actually correct.

Point A) would be achieved with quorum when the slowest node would never be 
selected to perform the actual data read. That way ReadResponseResolver would 
be satisfied without waiting for the slowest node. I think that's what's 
supposed to happen when using the dynamic snitch without any code change. And 
that was the point of question 1) in my last mail. question 2) is important to 
me because in the future we might want to read at CL 1 for other use cases and 
cassandra seems to shortcut the messaging service in the case where the proxy 
node contains the row. Thus in that case the node would respond even if it's 
the slowest node. So its not load balancing.

With question 3) I wanted to verify that my understanding of digest requests is 
correct. Before digging deeper I thought that cassandra would have some magical 
way of being able to calculate md5 digests for conflict resolution without 
reading the column values. But (I think) obviously it cannot do that because 
conflict resolution is not based on timestamps alone but will fall back to 
selecting the bigger value if timestamps are equal. This statement is important 
to me because it means that digest requests are equally expensive as regular 
read requests. And that I might be able to reduce IO pressure significantly 
when I change the way quorum read are performed.

And thats what question 4) was about:

Your question was:

> | Am I interpreting you correctly that you want to switch the default
> | read mode in the case of QUOROM to optimistically assume that data is
> | consistent and read from one node and only perform digest on the rest?


Well almost :-) In fact I believe you are describing what is happening now with 
a vanilla cassandra. Quorum reads are performed by reading the data from the 
node which is selected by the end point snitch (that is self, same rack, same 
dc with the standard one or score based with dynamic snitch). All other live 
nodes will receive digest requests.

Only when a digest mismatch occurs full read requests are sent to all nodes.

BTW: It seems (but thats probably only misinterpretation of the code) that 
disabling read repair is bad when doing quorum reads. For instance if the two 
digest requests return before the data request and one of the digest requests 
return a mismatch but the overall response would be valid nothing would be 
returned even though the correct data was present. The same thing seems to be 
true for every mismatch. I guess that using the read repair config here might 
be wrong or at least not intuitive.

What I want to do is change the behavior that on the first request run only the 
two 'best' nodes would be uses. One for data one for digest. To do that you'd 
only have to sort the live nodes by 'proximity' and use the first two. Only if 
that fails I would switch back to default behavior and do a full data read on 
all nodes. So in a perfect world of 3 consistent nodes I would reduce IO reads 
by 1/3.

Best, and thanks for your time (if you bothered to read all of this),
Daniel


On Dec 13, 2010, at 9:19 AM, Peter Schuller wrote:

>> 1) If using CL > 1 than using the dynamic snitch will result in a data read
>> from node with the lowest latency (little simplified) even if the proxy node
>> contains the data but has a higher latency that other possible nodes which
>> means that it is not necessary to do load-based balancing on the client
>> side.
>> 
>> 2) If using CL =1 than the proxy node will always return the data itself
>> even when there is another node with less load.
>> 
>> 3) Digest requests will be sent to all other living peer nodes for that key
>> and will result in a data read on all nodes to calculate the digest. The
>> only difference is that the data is not sent back but IO-wise it is just as
>> expensive.
> 
> I think I may just be completely misunderstanding something, but I'm
> not really sure to what extent you're trying to describe the current
> situation and to what extent you're suggesting changes? I'm not sure
> about (1) and (2) though my knee-jerk reaction is that I would expect
> it to be mostly agnostic w.r.t. which node happens to be taking the
> RPC call (e.g. the "latency" may be due to disk I/O and preferring the
> local node has lots of potential to be detrimental, while forwarding
> is only slightly more expensive).
> 
> (3) sounds like what's happening with read repair.
> 
>> The next one goes a little further:
>> 
>> We read / write with quorum / rf = 3.
>> 
>> It seems to me that it wouldn't be hard to patch the StorageProxy to send
>> only one read request and one digest request. Only if one of the requests
>> fail we would have to query the remaining node. We don't need read repair
>> because we have to repair once a week anyways and quorum guarantees
>> consistency. This way we could reduce read load significantly which should
>> compensate for latency increase by failing reads. Am I missing something?
> 
> Am I interpreting you correctly that you want to switch the default
> read mode in the case of QUOROM to optimistically assume that data is
> consistent and read from one node and only perform digest on the rest?
> 
> What's the goal here? The only thin saved by digest reads at QUROM
> seems to me to be the throughput saved by not saving the data. You're
> still taking the reads in terms of potential disk I/O, and you still
> have to wait for the response, and you're still taking almost all of
> the CPU hit (still reading and checksumming, just not sending back).
> For highly contended data the need to fallback to real needs would
> significantly increase average latency.
> 
> -- 
> / Peter Schuller

Reply via email to