Am I understanding correctly that you had all connections going to one
cassandra node, which caused one of the *other* nodes to die, and
spreading the connections around the cluster fixed it?

On Fri, Dec 3, 2010 at 4:00 AM, Daniel Doubleday
<daniel.double...@gmx.net> wrote:
> Hi all
>
> I have found an anti pattern the other day which I wanted to share, although 
> its pretty special case.
>
> Special case because our production cluster is somewhat strange: 3 servers, 
> rf = 3. We do consistent reads/writes with quorum.
>
> I did a long running read series (loads of reads as fast as I can) with one 
> connection. Since all queries could be handled by that node the overall 
> latency is determined by its own and the fastest second node (cause the 
> quorum is satisfied with 2 reads). What will happen than is that after a 
> couple of minutes one of the other two nodes will go in 100% io wait and will 
> drop most of its read messages. Leaving it practically dead while the other 2 
> nodes keep responding at an average of ~10ms. The node that died was only a 
> little slower ~13ms average but it will inevitably queue up messages. Average 
> response time increases to timeout (10 secs) flat. It never recovers.
>
> It happened all the time. And it wasn't the same node that would die.
>
> The solution was that I return the connection to the pool and get a new one 
> for every read to balance the load on the client side.
>
> Obviously this will not happen in a cluster where the percentage of all rows 
> on one node is enough. But the same thing will probably happen if you scan by 
> continuos tokens (meaning that you will read from the same node a long time).
>
> Cheers,
>
> Daniel Doubleday
> smeet.com, Berlin



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Reply via email to