Hi all

I have found an anti pattern the other day which I wanted to share, although 
its pretty special case.

Special case because our production cluster is somewhat strange: 3 servers, rf 
= 3. We do consistent reads/writes with quorum.

I did a long running read series (loads of reads as fast as I can) with one 
connection. Since all queries could be handled by that node the overall latency 
is determined by its own and the fastest second node (cause the quorum is 
satisfied with 2 reads). What will happen than is that after a couple of 
minutes one of the other two nodes will go in 100% io wait and will drop most 
of its read messages. Leaving it practically dead while the other 2 nodes keep 
responding at an average of ~10ms. The node that died was only a little slower 
~13ms average but it will inevitably queue up messages. Average response time 
increases to timeout (10 secs) flat. It never recovers.

It happened all the time. And it wasn't the same node that would die.

The solution was that I return the connection to the pool and get a new one for 
every read to balance the load on the client side.

Obviously this will not happen in a cluster where the percentage of all rows on 
one node is enough. But the same thing will probably happen if you scan by 
continuos tokens (meaning that you will read from the same node a long time).

Cheers,

Daniel Doubleday
smeet.com, Berlin

Reply via email to