> How can I handle this kind of situation? In terms of surviving the problem, a re-try on the client side might help assuming the problem is temporary.
However, certainly the fact that you're seeing an issue to begin with is interesting, and the way to avoid it would depend on what the problem is. My understanding is that the UnavailableException indicates that the node you are talking to was unable to read form/write to a sufficient number of nodes to satisfy your consistency level. Presumably either because individual requests failed to return in time, or because the node considers other nodes to be flat out down. Can you correlate these issues with server-side activity on the nodes, such as background compaction, commitlog rotation or memtable flushing? Do you see your nodes saying that other nodes in the cluster are "DOWN" and "UP" (flapping)? How large is the data set in total (in terms of sstable size on disk), and how much memory do you have in your machines (going to page cache)? Have you observed the behavior of your nodes during compaction; in particular whether compaction is CPU bound or I/O bound? (That would tend to depend on data; generally the larger the individual values the more disk bound you'd tend to be.) Just trying to zero in on what the likely root cause is in this case. -- / Peter Schuller