Am I understanding correctly that you had all connections going to one cassandra node, which caused one of the *other* nodes to die, and spreading the connections around the cluster fixed it?
On Fri, Dec 3, 2010 at 4:00 AM, Daniel Doubleday <daniel.double...@gmx.net> wrote: > Hi all > > I have found an anti pattern the other day which I wanted to share, although > its pretty special case. > > Special case because our production cluster is somewhat strange: 3 servers, > rf = 3. We do consistent reads/writes with quorum. > > I did a long running read series (loads of reads as fast as I can) with one > connection. Since all queries could be handled by that node the overall > latency is determined by its own and the fastest second node (cause the > quorum is satisfied with 2 reads). What will happen than is that after a > couple of minutes one of the other two nodes will go in 100% io wait and will > drop most of its read messages. Leaving it practically dead while the other 2 > nodes keep responding at an average of ~10ms. The node that died was only a > little slower ~13ms average but it will inevitably queue up messages. Average > response time increases to timeout (10 secs) flat. It never recovers. > > It happened all the time. And it wasn't the same node that would die. > > The solution was that I return the connection to the pool and get a new one > for every read to balance the load on the client side. > > Obviously this will not happen in a cluster where the percentage of all rows > on one node is enough. But the same thing will probably happen if you scan by > continuos tokens (meaning that you will read from the same node a long time). > > Cheers, > > Daniel Doubleday > smeet.com, Berlin -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com