Just a quick update, I was able to fix the problem by reverting the patch CASSANDRA-8336 in our custom cassandra build. I don't know the root cause yet though. I will open a JIRA ticket and post here for reference later.
On Fri, Jun 12, 2015 at 11:31 AM, Paulo Ricardo Motta Gomes < paulo.mo...@chaordicsystems.com> wrote: > Hello, > > We recently upgraded a cluster from 2.0.12 to 2.0.15 and now whenever we > stop/kill a cassandra process, some other nodes keep a connection with the > dead node in the CLOSE_WAIT state on port 7000 for about 5-20 minutes. > > So, if I start the killed node again, it cannot handshake with the nodes > which have a connection on the CLOSE_WAIT state until that connection is > closed, so they remain on the down state to each other for 5-20 minutes, > until they can handshake again. > > I believe this is somehow related to the fixes CASSANDRA-8336 and > CASSANDRA-9238, and also could be a duplicate of CASSANDRA-8072. I will > continue to investigate to see if I find more evidences, but any help at > this point would be appreciated, or at least a confirmation that it could > be related to any of these tickets. > > Cheers, > > -- > *Paulo Motta* > > Chaordic | *Platform* > *www.chaordic.com.br <http://www.chaordic.com.br/>* > +55 48 3232.3200 > -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br <http://www.chaordic.com.br/>* +55 48 3232.3200