Hi Kurt, thanks for your reply — really appreciate your (and everyone else’s!) continual assistance of people in the C* user community.
All of these clients & servers are on the same (internal) network, so there is no firewall between the clients & servers. Our C* application is a QA test results system. We have thousands of machines in-house which we use to test the software (not C* related) which we sell, and we’re using C* to capture the results of those tests. So the flow is: On each machine (~2500): … we run tests (~5-20 per machine) … each test has ~8 steps … each step makes a connection to the DB, logs to C* the start time, sub invokes the test script which runs the step (2 mins to 20 hours — no C* usage during this part), and then captures the result to C* (end time, exit status, etc). Today we’re not disconnecting C* during the “run the step” part - and we’re getting OperationTimedOut errors as we scale up the number of tests executing using our C* application. My theory is that we’re overwhelming C* with the sheer number of (mostly idle) connections to our 3-node cluster. I’m hoping someone has seen this sort of problem and can say “Yeah, that’s too many connections — I’m sure that’s your problem.” or “We regularly make 12M connections per C* node — you’re screwed up in some other way — have you checked file descriptor limits? What’s your Java __whatever__ setting?” etc. thanks Kurt. :-) - Max > On Dec 14, 2017, at 6:19 am, kurt greaves <k...@instaclustr.com > <mailto:k...@instaclustr.com>> wrote: > > I see time outs and I immediately blame firewalls. Have you triple checked > then? > Is this only occurring to a subset of clients? > > Also, 3.0.6 is pretty dated and has many bugs, you should definitely upgrade > to the latest 3.0 (don't forget to read news.txt) > On 14 Dec. 2017 19:18, "Max Campos" <mc_cassan...@core43.com > <mailto:mc_cassan...@core43.com>> wrote: > Hi - > > We’re finally putting our new application under load, and we’re starting to > get this error message from the Python driver when under heavy load: > > ('Unable to connect to any servers', {‘x.y.z.205': > OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.204': > OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.206': > OperationTimedOut('errors=None, last_host=None',)})' (22.7s) > > Our cluster is running 3.0.6, has 3 nodes and we use RF=3, CL=QUORUM > reads/writes. We have a few thousand machines which are each making 1-10 > connections to C* at once, but each of these connections only reads/writes a > few records, waits several minutes, and then writes a few records — so while > netstat reports ~5K connections per node, they’re generally idle. Peak > read/sec today was ~1500 per node, peak writes/sec was ~300 per node. > Read/write latencies peaked at 2.5ms. > > Some questions: > 1) Is anyone else out there making this many simultaneous connections? Any > idea what a reasonable number of connections is, what is too many, etc? > > 2) Any thoughts on which JMX metrics I should look at to better understand > what exactly is exploding? Is there a “number of active connections” metric? > We currently look at: > - client reads/writes per sec > - read/write latency > - compaction tasks > - repair tasks > - disk used by node > - disk used by table > - avg partition size per table > > 3) Any other advice? > > I think I’ll try doing an explicit disconnect during the waiting period of > our application’s execution; so as to get the C* connection count down. > Hopefully that will solve the timeout problem. > > Thanks for your help. > > - Max > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > <mailto:user-unsubscr...@cassandra.apache.org> > For additional commands, e-mail: user-h...@cassandra.apache.org > <mailto:user-h...@cassandra.apache.org> > >