Even if it is a network error it would be good to detect it. If you can run a small repair with those log settings I'll can take a look at the logs if you want. Cannot promise anything but another set of eyes may help.
Ping me off list if you want to send me the logs. Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 17/07/2012, at 4:32 AM, Bill Au wrote: > I had ran into the same problem before: > > http://comments.gmane.org/gmane.comp.db.cassandra.user/25334 > > I have not fond any solutions yet. > > Bill > > On Mon, Jul 16, 2012 at 11:10 AM, Bart Swedrowski <b...@timedout.org> wrote: > > > On 16 July 2012 11:25, aaron morton <aa...@thelastpickle.com> wrote: > In the before time someone had problems with a switch/router that was > dropping persistent but idle connections. Doubt this applies, and it would > probably result in an error, just throwing it out there. > > Yes, been through them few times. There's literally no errors or warning at > all. And sometimes, as aforementioned, there's actually INFO that merkle > tree has been sent where the other side is not receiving it. > > Just now, I kicked off manual repair on node with IP 192.168.94.178 and just > got stuck on streaming files again. > > Node 192.168.94.179: > > Streaming from: /192.168.81.5 > Medals: /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db sections=46 > progress=0/5096 - 0% > Medals: /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db sections=244 > progress=0/1548510 - 0% > Medals: /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db sections=228 > progress=0/82859 - 0% > > Node 192.168.81.5: > > Streaming to: /192.168.94.179 > /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2 > progress=168/168 - 100% > /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db sections=244 > progress=0/1548510 - 0% > /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db sections=46 > progress=0/5096 - 0% > /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db sections=228 > progress=0/82859 - 0% > > Looks like streaming this specific SSTable hasn't finished (or been ACKed on > the other side) > > /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2 > progress=168/168 - 100% > > This morning I've tightend monitoring so now we've each node monitoring each > other with ICMP packets (20 every minute) and monitoring is silent; no issues > reported since the morning, not a single packet lost. > > I got some help from Acunu guys, first we believed we fixed the problem by > disabling bonding on the servers and blamed it for messing up stuff with > interrupts however this morning problem resurfaced. > > I can see (and Acunu says) everything is pointing to network related problem > (although I'd expect IP stack to correct simple PL) but there's no way to > back this up (unless only Cassandra related traffic is getting lost but *how* > to monitor for it???). > > Honestly, running out of ideas - further advice highly appreciated. >