I had ran into the same problem before: http://comments.gmane.org/gmane.comp.db.cassandra.user/25334
I have not fond any solutions yet. Bill On Mon, Jul 16, 2012 at 11:10 AM, Bart Swedrowski <b...@timedout.org> wrote: > > > On 16 July 2012 11:25, aaron morton <aa...@thelastpickle.com> wrote: > >> In the before time someone had problems with a switch/router that was >> dropping persistent but idle connections. Doubt this applies, and it would >> probably result in an error, just throwing it out there. >> > > Yes, been through them few times. There's literally no errors or warning > at all. And sometimes, as aforementioned, there's actually INFO that > merkle tree has been sent where the other side is not receiving it. > > Just now, I kicked off manual repair on node with IP 192.168.94.178 and > just got stuck on streaming files again. > > Node 192.168.94.179: > > Streaming from: /192.168.81.5 >> Medals: /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db >> sections=46 progress=0/5096 - 0% >> Medals: /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db >> sections=244 progress=0/1548510 - 0% >> Medals: /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db >> sections=228 progress=0/82859 - 0% > > > Node 192.168.81.5: > > Streaming to: /192.168.94.179 >> /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2 >> progress=168/168 - 100% >> /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db sections=244 >> progress=0/1548510 - 0% >> /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db sections=46 >> progress=0/5096 - 0% >> /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db sections=228 >> progress=0/82859 - 0% > > > Looks like streaming this specific SSTable hasn't finished (or been ACKed > on the other side) > > /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2 >> progress=168/168 - 100% > > > This morning I've tightend monitoring so now we've each node monitoring > each other with ICMP packets (20 every minute) and monitoring is silent; no > issues reported since the morning, not a single packet lost. > > I got some help from Acunu guys, first we believed we fixed the problem by > disabling bonding on the servers and blamed it for messing up stuff with > interrupts however this morning problem resurfaced. > > I can see (and Acunu says) everything is pointing to network related > problem (although I'd expect IP stack to correct simple PL) but there's no > way to back this up (unless only Cassandra related traffic is getting lost > but *how* to monitor for it???). > > Honestly, running out of ideas - further advice highly appreciated. >