Re: Never ending manual repair after adding second DC

Bill Au Mon, 16 Jul 2012 09:32:49 -0700

I had ran into the same problem before:

http://comments.gmane.org/gmane.comp.db.cassandra.user/25334


I have not fond any solutions yet.

Bill

On Mon, Jul 16, 2012 at 11:10 AM, Bart Swedrowski <b...@timedout.org> wrote:

>
>
> On 16 July 2012 11:25, aaron morton <aa...@thelastpickle.com> wrote:
>
>> In the before time someone had problems with a switch/router that was
>> dropping persistent but idle connections. Doubt this applies, and it would
>> probably result in an error, just throwing it out there.
>>
>
> Yes, been through them few times.  There's literally no errors or warning
> at all.  And sometimes, as aforementioned, there's actually INFO that
> merkle tree has been sent where the other side is not receiving it.
>
> Just now, I kicked off manual repair on node with IP 192.168.94.178 and
> just got stuck on streaming files again.
>
> Node 192.168.94.179:
>
> Streaming from: /192.168.81.5
>>    Medals: /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db
>> sections=46 progress=0/5096 - 0%
>>    Medals: /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db
>> sections=244 progress=0/1548510 - 0%
>>    Medals: /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db
>> sections=228 progress=0/82859 - 0%
>
>
> Node 192.168.81.5:
>
> Streaming to: /192.168.94.179
>>    /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2
>> progress=168/168 - 100%
>>    /var/lib/cassandra/data/Medals/dataa-hd-1128-Data.db sections=244
>> progress=0/1548510 - 0%
>>    /var/lib/cassandra/data/Medals/dataa-hd-1127-Data.db sections=46
>> progress=0/5096 - 0%
>>    /var/lib/cassandra/data/Medals/dataa-hd-1119-Data.db sections=228
>> progress=0/82859 - 0%
>
>
> Looks like streaming this specific SSTable hasn't finished (or been ACKed
> on the other side)
>
>    /var/lib/cassandra/data/Medals/dataa-hd-1129-Data.db sections=2
>> progress=168/168 - 100%
>
>
> This morning I've tightend monitoring so now we've each node monitoring
> each other with ICMP packets (20 every minute) and monitoring is silent; no
> issues reported since the morning, not a single packet lost.
>
> I got some help from Acunu guys, first we believed we fixed the problem by
> disabling bonding on the servers and blamed it for messing up stuff with
> interrupts however this morning problem resurfaced.
>
> I can see (and Acunu says) everything is pointing to network related
> problem (although I'd expect IP stack to correct simple PL) but there's no
> way to back this up (unless only Cassandra related traffic is getting lost
> but *how* to monitor for it???).
>
> Honestly, running out of ideas - further advice highly appreciated.
>

Re: Never ending manual repair after adding second DC

Reply via email to