Hi -

We’re finally putting our new application under load, and we’re starting to get 
this error message from the Python driver when under heavy load:

('Unable to connect to any servers', {‘x.y.z.205': 
OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.204': 
OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.206': 
OperationTimedOut('errors=None, last_host=None',)})' (22.7s)

Our cluster is running 3.0.6, has 3 nodes and we use RF=3, CL=QUORUM 
reads/writes.  We have a few thousand machines which are each making 1-10 
connections to C* at once, but each of these connections only reads/writes a 
few records, waits several minutes, and then writes a few records — so while 
netstat reports ~5K connections per node, they’re generally idle.  Peak 
read/sec today was ~1500 per node, peak writes/sec was ~300 per node.  
Read/write latencies peaked at 2.5ms.

Some questions:
1) Is anyone else out there making this many simultaneous connections?  Any 
idea what a reasonable number of connections is, what is too many, etc?

2) Any thoughts on which JMX metrics I should look at to better understand what 
exactly is exploding?  Is there a “number of active connections” metric?  We 
currently look at:
- client reads/writes per sec
- read/write latency
- compaction tasks
- repair tasks
- disk used by node
- disk used by table
- avg partition size per table

3) Any other advice?  

I think I’ll try doing an explicit disconnect during the waiting period of our 
application’s execution; so as to get the C* connection count down.  Hopefully 
that will solve the timeout problem.

Thanks for your help.

- Max
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Reply via email to