We are running into a reproducible issue in one of our cassandra clusters.  We 
are seeing that during an anti-entropy repair, if a particular sstable is 
streaming to multiple endpoints and the two streams happen to hit the same 
section of the sstable, it stalls all streams indefinitely on the source node.  
The only way we can clear this is to restart cassandra on the node, or cause 
the sockets to timeout by dropping the switch port, drop networking, etc.  The 
underlying tcp connection shows established on both source and target nodes, so 
cassandra's socket timeouts are not triggering.  It seems that some sort of 
deadlock is happening inside the source node's streaming manager?

We are running cassandra 1.2.5.  I have checked through the change logs up to 
1.2.16 and do not see any indications of this being a known (and fixed) issue.
I think the perfect storm that allows this to happen is none of the target 
nodes have the sstable, and streamthroughput is such that the streams are 
running at similar speed.

Example output from nodetool netstats is below (progress does not change, no 
additional data can be streamed to these endpoints because the first file is 
not completed, which effectively stalls repairs)

Mode: NORMAL
Streaming to: /172.24.58.23
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-699-Data.db
 sections=1445 progress=41943040/66686679 - 62%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-702-Data.db
 sections=1409 progress=0/675554186 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-781-Data.db
 sections=1448 progress=0/5578074 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-704-Data.db
 sections=1457 progress=0/263084543 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-705-Data.db
 sections=1419 progress=0/267463691 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-771-Data.db
 sections=1449 progress=0/69152270 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-700-Data.db
 sections=1394 progress=0/185688159 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-698-Data.db
 sections=1421 progress=0/748217766 - 0%
Streaming to: /172.24.58.33
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-699-Data.db
 sections=1445 progress=20971520/66686679 - 31%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-702-Data.db
 sections=1409 progress=0/675554186 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-781-Data.db
 sections=1448 progress=0/5578074 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-704-Data.db
 sections=1457 progress=0/263084543 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-705-Data.db
 sections=1419 progress=0/267463691 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-771-Data.db
 sections=1449 progress=0/69152270 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-700-Data.db
 sections=1394 progress=0/185688159 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-698-Data.db
 sections=1421 progress=0/748217766 - 0%
Streaming to: /172.24.58.24
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-699-Data.db
 sections=1445 progress=20971520/66686679 - 31%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-783-Data.db
 sections=1447 progress=0/2596067 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-702-Data.db
 sections=1409 progress=0/675554186 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-781-Data.db
 sections=1448 progress=0/5578074 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-704-Data.db
 sections=1457 progress=0/263084543 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-705-Data.db
 sections=1419 progress=0/267463691 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-771-Data.db
 sections=1449 progress=0/69152270 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-700-Data.db
 sections=1394 progress=0/185688159 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-784-Data.db
 sections=1448 progress=0/8519551 - 0%
   
/usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-698-Data.db
 sections=1421 progress=0/748217766 - 0%
Not receiving any streams.
Pool Name                    Active   Pending      Completed
Commands                        n/a         0       39393765
Responses                       n/a         0       21929307

I would appreciate any feedback or advice on this. thanks,
-Andrew
andrew.coo...@nisc.coop<mailto:andrew.coo...@nisc.coop>

Reply via email to