As a random guess, you might want to check your open file descriptor limit on the C* servers. Use "cat /proc/<pid>/limits", where <pid> is the pid of the Cassandra process; it's the most reliable way to check this.
On Thu, Jun 14, 2012 at 10:43 AM, Henrik Schröder <skro...@gmail.com> wrote: > Hi Mina, > > The delay is not constant, in the absolute majority of cases, connecting > is almost instant, but occasionally, connecting to a server takes a few > seconds. > > We can't even reproduce it reliably, we can see in our server logs that > sometimes, maybe a few times a day, maybe once every few days, a cassandra > server will be slow in accepting connections, and after a little while > everything will be ok again. It's not a network saturation error, it's not > a CPU saturation error. Not even GC pauses. > > Has anyone else noticed something similar? Or is this simply a result of > us running a tight connection pool which recycles connections every few > hours and only waits a few seconds for a connection before timing out? > > > /Henrik > > > On Thu, Jun 14, 2012 at 4:54 PM, Mina Naguib <mina.nag...@bloomdigital.com > > wrote: > >> >> On 2012-06-14, at 10:38 AM, Henrik Schröder wrote: >> >> > Hi everyone, >> > >> > We have problem with our Cassandra cluster, and that is that sometimes >> it takes several seconds to open a new Thrift connection to the server. >> We've had this issue when we ran on windows, and we have this issue now >> that we run on Ubuntu. We've had it with our old networking setup, and we >> have it with our new networking setup where we're running it over a >> dedicated gigabit network. Normally estabishing a new connection is >> instant, but once in a while it seems like it's not accepting any new >> connections until three seconds have passed. >> > >> > We're of course running a connection-pooling client which mitigates >> this, since once a connection is established, it's rock solid. >> > >> > We tried switching the rpc_server_type to hsha, but that seems to have >> made the problem worse, we're seeing more connection timeouts because of >> this. >> > >> > For what it's woth, we're running Cassandra version 1.0.10 on Ubuntu, >> and our connection pool is configured to abort a connection attempt after >> two seconds, and each connection lives for six hours and then it's >> recycled. Under current load we do about 500 writes/s and 100 reads/s, we >> have 20 clients, but each has a very small connection pool of maybe up to 5 >> simultaneous connections against each Cassandra server. We see these >> connection issues maybe once a day, but always at random intervals. >> > >> > We've tried to get more information through Datastax Opscenter, the JMX >> console, and our own application monitoring and logging, but we can't see >> anything out of the ordinary. Sometimes, seemingly by random, it's just >> really slow to connect. We're all out of ideas. Does anyone here have >> suggestions on where to look and what to do next? >> >> Have you ironed out non-cassandra potential causes ? >> >> 3 seconds constantly sounds it could be a timeout/retry somewhere. Do >> you contact cassandra via a hostname or IP address ? If via hostname, iron >> out DNS. >> >> Either way, I'd fire up tcpdump, both on both the client and the server, >> and observe the TCP handshake. Specifically see if the SYN packet is sent >> and received, whether the SYN-ACK is sent back right away and received, and >> final ACK. >> >> If that looks good, then TCP-wise you're in good shape and the problem is >> in a higher layer (thrift). If not, see where the delay/drop/retry >> happens. If it's in the first packet, it may be a networking/routing >> issue. If in the second, it may me capacity at the server (investigate >> with lsof/netstat/JMX), etc.. >> >> >> > -- Tyler Hobbs DataStax <http://datastax.com/>