Hi Peter I should have started with the why instead of what ...
Background Info (I try to be brief ...) We have a very small production cluster (started with 3 nodes, now we have 5). Most of our data is currently in mysql but we want to slowly move the larger tables which are killing our mysql cache to cassandra. After migration of another table we faced under-capacity last week. Since we couldn't cope with it and the old system was still running in the background we switched back and added another 2 nodes while still performing writes to the cluster (but no reads). We are entirely IO-bound. What killed us last week were to many reads combined with flushes and compactions. Reducing compaction priority helped but it was not enough. Them main problem why we could not add nodes though had to do with the quorum reads we are doing: First we stopped compaction on all nodes. Everything was golden. The cluster handled the load easily. Than we bootstrapped a new node. That increased the IO-pressure on the node which was selected as streaming source because it started anti-compcation. The increase pressure slowed the node down. Just a little, but enough to get flooded by digest requests from the other nodes. We have seen this before: http://comments.gmane.org/gmane.comp.db.cassandra.user/10687. So the status at that point was: 2 nodes that were serving requests at 10 - 50ms and one that errr... wouldn't serve requests (average response time storage proxy was 10 secs). The problem here was that users would suffer from the slow server because it was not down and still being queried from clients. Also because the streaming node was so overwhelmed anticompaction became *real* slow. It would have taken days to finish. Than we took one node down to prevent digest flooding and it was much better. It almost worked out but at peak hours it collapsed. At this point we rolled back. From this my learning / guessing is: We could have survived this if the streaming node would not have had to serve - read requests (because than users would not have been affected) and / or - digest requests (because that would have reduced io pressure) To summarize: I want to prevent that the slowest node A) affects latency at the level of StorageProxy B) gets digest requests because the only thing they are good for is killing it Ok sorry - that was not brief ... Back to my original mail. 1-3) were supposed to describe the current situation in order to validate that my understanding is actually correct. Point A) would be achieved with quorum when the slowest node would never be selected to perform the actual data read. That way ReadResponseResolver would be satisfied without waiting for the slowest node. I think that's what's supposed to happen when using the dynamic snitch without any code change. And that was the point of question 1) in my last mail. question 2) is important to me because in the future we might want to read at CL 1 for other use cases and cassandra seems to shortcut the messaging service in the case where the proxy node contains the row. Thus in that case the node would respond even if it's the slowest node. So its not load balancing. With question 3) I wanted to verify that my understanding of digest requests is correct. Before digging deeper I thought that cassandra would have some magical way of being able to calculate md5 digests for conflict resolution without reading the column values. But (I think) obviously it cannot do that because conflict resolution is not based on timestamps alone but will fall back to selecting the bigger value if timestamps are equal. This statement is important to me because it means that digest requests are equally expensive as regular read requests. And that I might be able to reduce IO pressure significantly when I change the way quorum read are performed. And thats what question 4) was about: Your question was: > | Am I interpreting you correctly that you want to switch the default > | read mode in the case of QUOROM to optimistically assume that data is > | consistent and read from one node and only perform digest on the rest? Well almost :-) In fact I believe you are describing what is happening now with a vanilla cassandra. Quorum reads are performed by reading the data from the node which is selected by the end point snitch (that is self, same rack, same dc with the standard one or score based with dynamic snitch). All other live nodes will receive digest requests. Only when a digest mismatch occurs full read requests are sent to all nodes. BTW: It seems (but thats probably only misinterpretation of the code) that disabling read repair is bad when doing quorum reads. For instance if the two digest requests return before the data request and one of the digest requests return a mismatch but the overall response would be valid nothing would be returned even though the correct data was present. The same thing seems to be true for every mismatch. I guess that using the read repair config here might be wrong or at least not intuitive. What I want to do is change the behavior that on the first request run only the two 'best' nodes would be uses. One for data one for digest. To do that you'd only have to sort the live nodes by 'proximity' and use the first two. Only if that fails I would switch back to default behavior and do a full data read on all nodes. So in a perfect world of 3 consistent nodes I would reduce IO reads by 1/3. Best, and thanks for your time (if you bothered to read all of this), Daniel On Dec 13, 2010, at 9:19 AM, Peter Schuller wrote: >> 1) If using CL > 1 than using the dynamic snitch will result in a data read >> from node with the lowest latency (little simplified) even if the proxy node >> contains the data but has a higher latency that other possible nodes which >> means that it is not necessary to do load-based balancing on the client >> side. >> >> 2) If using CL =1 than the proxy node will always return the data itself >> even when there is another node with less load. >> >> 3) Digest requests will be sent to all other living peer nodes for that key >> and will result in a data read on all nodes to calculate the digest. The >> only difference is that the data is not sent back but IO-wise it is just as >> expensive. > > I think I may just be completely misunderstanding something, but I'm > not really sure to what extent you're trying to describe the current > situation and to what extent you're suggesting changes? I'm not sure > about (1) and (2) though my knee-jerk reaction is that I would expect > it to be mostly agnostic w.r.t. which node happens to be taking the > RPC call (e.g. the "latency" may be due to disk I/O and preferring the > local node has lots of potential to be detrimental, while forwarding > is only slightly more expensive). > > (3) sounds like what's happening with read repair. > >> The next one goes a little further: >> >> We read / write with quorum / rf = 3. >> >> It seems to me that it wouldn't be hard to patch the StorageProxy to send >> only one read request and one digest request. Only if one of the requests >> fail we would have to query the remaining node. We don't need read repair >> because we have to repair once a week anyways and quorum guarantees >> consistency. This way we could reduce read load significantly which should >> compensate for latency increase by failing reads. Am I missing something? > > Am I interpreting you correctly that you want to switch the default > read mode in the case of QUOROM to optimistically assume that data is > consistent and read from one node and only perform digest on the rest? > > What's the goal here? The only thin saved by digest reads at QUROM > seems to me to be the throughput saved by not saving the data. You're > still taking the reads in terms of potential disk I/O, and you still > have to wait for the response, and you're still taking almost all of > the CPU hit (still reading and checksumming, just not sending back). > For highly contended data the need to fallback to real needs would > significantly increase average latency. > > -- > / Peter Schuller