We are running into a reproducible issue in one of our cassandra clusters. We are seeing that during an anti-entropy repair, if a particular sstable is streaming to multiple endpoints and the two streams happen to hit the same section of the sstable, it stalls all streams indefinitely on the source node. The only way we can clear this is to restart cassandra on the node, or cause the sockets to timeout by dropping the switch port, drop networking, etc. The underlying tcp connection shows established on both source and target nodes, so cassandra's socket timeouts are not triggering. It seems that some sort of deadlock is happening inside the source node's streaming manager?
We are running cassandra 1.2.5. I have checked through the change logs up to 1.2.16 and do not see any indications of this being a known (and fixed) issue. I think the perfect storm that allows this to happen is none of the target nodes have the sstable, and streamthroughput is such that the streams are running at similar speed. Example output from nodetool netstats is below (progress does not change, no additional data can be streamed to these endpoints because the first file is not completed, which effectively stalls repairs) Mode: NORMAL Streaming to: /172.24.58.23 /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-699-Data.db sections=1445 progress=41943040/66686679 - 62% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-702-Data.db sections=1409 progress=0/675554186 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-781-Data.db sections=1448 progress=0/5578074 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-704-Data.db sections=1457 progress=0/263084543 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-705-Data.db sections=1419 progress=0/267463691 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-771-Data.db sections=1449 progress=0/69152270 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-700-Data.db sections=1394 progress=0/185688159 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-698-Data.db sections=1421 progress=0/748217766 - 0% Streaming to: /172.24.58.33 /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-699-Data.db sections=1445 progress=20971520/66686679 - 31% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-702-Data.db sections=1409 progress=0/675554186 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-781-Data.db sections=1448 progress=0/5578074 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-704-Data.db sections=1457 progress=0/263084543 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-705-Data.db sections=1419 progress=0/267463691 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-771-Data.db sections=1449 progress=0/69152270 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-700-Data.db sections=1394 progress=0/185688159 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-698-Data.db sections=1421 progress=0/748217766 - 0% Streaming to: /172.24.58.24 /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-699-Data.db sections=1445 progress=20971520/66686679 - 31% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-783-Data.db sections=1447 progress=0/2596067 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-702-Data.db sections=1409 progress=0/675554186 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-781-Data.db sections=1448 progress=0/5578074 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-704-Data.db sections=1457 progress=0/263084543 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-705-Data.db sections=1419 progress=0/267463691 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-771-Data.db sections=1449 progress=0/69152270 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-700-Data.db sections=1394 progress=0/185688159 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-784-Data.db sections=1448 progress=0/8519551 - 0% /usr/lib/cassandra/data/data/mdm/mvec_intervals/mdm-mvec_intervals-ic-698-Data.db sections=1421 progress=0/748217766 - 0% Not receiving any streams. Pool Name Active Pending Completed Commands n/a 0 39393765 Responses n/a 0 21929307 I would appreciate any feedback or advice on this. thanks, -Andrew andrew.coo...@nisc.coop<mailto:andrew.coo...@nisc.coop>