Hi - We’re finally putting our new application under load, and we’re starting to get this error message from the Python driver when under heavy load:
('Unable to connect to any servers', {‘x.y.z.205': OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.204': OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.206': OperationTimedOut('errors=None, last_host=None',)})' (22.7s) Our cluster is running 3.0.6, has 3 nodes and we use RF=3, CL=QUORUM reads/writes. We have a few thousand machines which are each making 1-10 connections to C* at once, but each of these connections only reads/writes a few records, waits several minutes, and then writes a few records — so while netstat reports ~5K connections per node, they’re generally idle. Peak read/sec today was ~1500 per node, peak writes/sec was ~300 per node. Read/write latencies peaked at 2.5ms. Some questions: 1) Is anyone else out there making this many simultaneous connections? Any idea what a reasonable number of connections is, what is too many, etc? 2) Any thoughts on which JMX metrics I should look at to better understand what exactly is exploding? Is there a “number of active connections” metric? We currently look at: - client reads/writes per sec - read/write latency - compaction tasks - repair tasks - disk used by node - disk used by table - avg partition size per table 3) Any other advice? I think I’ll try doing an explicit disconnect during the waiting period of our application’s execution; so as to get the C* connection count down. Hopefully that will solve the timeout problem. Thanks for your help. - Max --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org