As a random guess, you might want to check your open file descriptor limit
on the C* servers.  Use "cat /proc/<pid>/limits", where <pid> is the pid of
the Cassandra process; it's the most reliable way to check this.

On Thu, Jun 14, 2012 at 10:43 AM, Henrik Schröder <skro...@gmail.com> wrote:

> Hi Mina,
>
> The delay is not constant, in the absolute majority of cases, connecting
> is almost instant, but occasionally, connecting to a server takes a few
> seconds.
>
> We can't even reproduce it reliably, we can see in our server logs that
> sometimes, maybe a few times a day, maybe once every few days, a cassandra
> server will be slow in accepting connections, and after a little while
> everything will be ok again. It's not a network saturation error, it's not
> a CPU saturation error. Not even GC pauses.
>
> Has anyone else noticed something similar? Or is this simply a result of
> us running a tight connection pool which recycles connections every few
> hours and only waits a few seconds for a connection before timing out?
>
>
> /Henrik
>
>
> On Thu, Jun 14, 2012 at 4:54 PM, Mina Naguib <mina.nag...@bloomdigital.com
> > wrote:
>
>>
>> On 2012-06-14, at 10:38 AM, Henrik Schröder wrote:
>>
>> > Hi everyone,
>> >
>> > We have problem with our Cassandra cluster, and that is that sometimes
>> it takes several seconds to open a new Thrift connection to the server.
>> We've had this issue when we ran on windows, and we have this issue now
>> that we run on Ubuntu. We've had it with our old networking setup, and we
>> have it with our new networking setup where we're running it over a
>> dedicated gigabit network. Normally estabishing a new connection is
>> instant, but once in a while it seems like it's not accepting any new
>> connections until three seconds have passed.
>> >
>> > We're of course running a connection-pooling client which mitigates
>> this, since once a connection is established, it's rock solid.
>> >
>> > We tried switching the rpc_server_type to hsha, but that seems to have
>> made the problem worse, we're seeing more connection timeouts because of
>> this.
>> >
>> > For what it's woth, we're running Cassandra version 1.0.10 on Ubuntu,
>> and our connection pool is configured to abort a connection attempt after
>> two seconds, and each connection lives for six hours and then it's
>> recycled. Under current load we do about 500 writes/s and 100 reads/s, we
>> have 20 clients, but each has a very small connection pool of maybe up to 5
>> simultaneous connections against each Cassandra server. We see these
>> connection issues maybe once a day, but always at random intervals.
>> >
>> > We've tried to get more information through Datastax Opscenter, the JMX
>> console, and our own application monitoring and logging, but we can't see
>> anything out of the ordinary. Sometimes, seemingly by random, it's just
>> really slow to connect. We're all out of ideas. Does anyone here have
>> suggestions on where to look and what to do next?
>>
>> Have you ironed out non-cassandra potential causes ?
>>
>> 3 seconds constantly sounds it could be a timeout/retry somewhere.  Do
>> you contact cassandra via a hostname or IP address ?  If via hostname, iron
>> out DNS.
>>
>> Either way, I'd fire up tcpdump, both on both the client and the server,
>> and observe the TCP handshake.  Specifically see if the SYN packet is sent
>> and received, whether the SYN-ACK is sent back right away and received, and
>> final ACK.
>>
>> If that looks good, then TCP-wise you're in good shape and the problem is
>> in a higher layer (thrift).  If not, see where the delay/drop/retry
>> happens.  If it's in the first packet, it may be a networking/routing
>> issue.  If in the second, it may me capacity at the server (investigate
>> with lsof/netstat/JMX), etc..
>>
>>
>>
>


-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Reply via email to