Hi Kurt, thanks for your reply — really appreciate your (and everyone else’s!) 
continual assistance of people in the C* user community.

All of these clients & servers are on the same (internal) network, so there is 
no firewall between the clients & servers.  

Our C* application is a QA test results system.  We have thousands of machines 
in-house which we use to test the software (not C* related) which we sell, and 
we’re using C* to capture the results of those tests.

So the flow is:
On each machine (~2500):
… we run tests (~5-20 per machine)
… each test has ~8 steps
… each step makes a connection to the DB, logs to C* the start time, sub 
invokes the test script which runs the step (2 mins to 20 hours — no C* usage 
during this part), and then captures the result to C* (end time, exit status, 
etc).

Today we’re not disconnecting C* during the “run the step” part - and we’re 
getting OperationTimedOut errors as we scale up the number of tests executing 
using our C* application.  My theory is that we’re overwhelming C* with the 
sheer number of (mostly idle) connections to our 3-node cluster.

I’m hoping someone has seen this sort of problem and can say “Yeah, that’s too 
many connections — I’m sure that’s your problem.”  or “We regularly make 12M 
connections per C* node — you’re screwed up in some other way — have you 
checked file descriptor limits?  What’s your Java __whatever__ setting?”  etc.

thanks Kurt.  :-)

- Max

> On Dec 14, 2017, at 6:19 am, kurt greaves <k...@instaclustr.com 
> <mailto:k...@instaclustr.com>> wrote:
> 
> I see time outs and I immediately blame firewalls. Have you triple checked 
> then?
> Is this only occurring to a subset of clients?
> 
> Also, 3.0.6 is pretty dated and has many bugs, you should definitely upgrade 
> to the latest 3.0 (don't forget to read news.txt)
> On 14 Dec. 2017 19:18, "Max Campos" <mc_cassan...@core43.com 
> <mailto:mc_cassan...@core43.com>> wrote:
> Hi -
> 
> We’re finally putting our new application under load, and we’re starting to 
> get this error message from the Python driver when under heavy load:
> 
> ('Unable to connect to any servers', {‘x.y.z.205': 
> OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.204': 
> OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.206': 
> OperationTimedOut('errors=None, last_host=None',)})' (22.7s)
> 
> Our cluster is running 3.0.6, has 3 nodes and we use RF=3, CL=QUORUM 
> reads/writes.  We have a few thousand machines which are each making 1-10 
> connections to C* at once, but each of these connections only reads/writes a 
> few records, waits several minutes, and then writes a few records — so while 
> netstat reports ~5K connections per node, they’re generally idle.  Peak 
> read/sec today was ~1500 per node, peak writes/sec was ~300 per node.  
> Read/write latencies peaked at 2.5ms.
> 
> Some questions:
> 1) Is anyone else out there making this many simultaneous connections?  Any 
> idea what a reasonable number of connections is, what is too many, etc?
> 
> 2) Any thoughts on which JMX metrics I should look at to better understand 
> what exactly is exploding?  Is there a “number of active connections” metric? 
>  We currently look at:
> - client reads/writes per sec
> - read/write latency
> - compaction tasks
> - repair tasks
> - disk used by node
> - disk used by table
> - avg partition size per table
> 
> 3) Any other advice?
> 
> I think I’ll try doing an explicit disconnect during the waiting period of 
> our application’s execution; so as to get the C* connection count down.  
> Hopefully that will solve the timeout problem.
> 
> Thanks for your help.
> 
> - Max
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
> <mailto:user-unsubscr...@cassandra.apache.org>
> For additional commands, e-mail: user-h...@cassandra.apache.org 
> <mailto:user-h...@cassandra.apache.org>
> 
> 

Reply via email to