Re: repair never completes with "finished successfully"

aaron morton Tue, 12 Apr 2011 18:55:27 -0700

Ah, unreadable rows and in the validation compaction no less. Makes a little 
more sense now.

Anyone help with the EOF when deserializing columns ? Is the fix to run scrub 
or drop the sstable ?

Here's a a theory, AES is trying to...

1) Create TreeRequest 's that specify a range we want to validate. 
2) Send TreeRequest 's to local node and neighbour
3) Process TreeRequest by running a validation compaction 
(CompactionManager.doValidationCompaction in your prev stacks)
4) When both TreeRequests return back work out the differences and then stream 
data if needed. 

Perhaps step 3 is not completing because of errors like 
http://www.mail-archive.com/user@cassandra.apache.org/msg12196.html If the row 
is over multiple sstables we can skip the row in one sstable. However if it's 
in a single sstable PrecompactedRow will raise an IOError if there is a 
problem. This is not what is in the linked error stack that shows a row been 
skipped, just a hunch we could checkout.

Do you see an IOErrors (not exceptions) in the logs or exceptions with 
doValidationCompaction in the stack?

For a tree request on the node you start compaction on you should see these 
logs...
1) Waiting for repair requests...
2) One of "Stored local tree" or "Stored remote tree" depending on which 
returns first at DEBUG level
3) "Queuing comparison"

If we do not have the 3rd log then we did not get a replay from either local or 
remote. 

Aaron

On 13 Apr 2011, at 00:57, Jonathan Colby wrote:

> There is no "Repair session" message either.   It just starts with a message 
> like:
> 
> INFO [manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723] 2011-04-10 
> 14:00:59,051 AntiEntropyService.java (line 770) Waiting for repair requests: 
> [#<TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, 
> /10.46.108.101, (DFS,main)>, #<TreeRequest 
> manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.100, 
> (DFS,main)>, #<TreeRequest 
> manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.102, 
> (DFS,main)>, #<TreeRequest 
> manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.101, 
> (DFS,main)>]
> 
> NETSTATS:
> 
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name                    Active   Pending      Completed
> Commands                        n/a         0         150846
> Responses                       n/a         0         443183
> 
> One node in our cluster still has "unreadable rows", where the reads trip up 
> every time for certain sstables (you've probably seen my earlier threads 
> regarding that).   My suspicion is that the bloom filter read on the node 
> with the corrupt sstables is never reporting back to the repair, thereby 
> causing it to hang.
> 
> 
> What would be great is a scrub tool that ignores unreadable/unserializable 
> rows!  : )
> 
> 
> On Apr 12, 2011, at 2:15 PM, aaron morton wrote:
> 
>> Do you see a message starting "Repair session " and ending with "completed 
>> successfully" ?
>> 
>> Or do you see any streaming activity using "nodetool netstats"
>> 
>> Repair can hang if a neighbour dies and fails to send a requested stream. It 
>> will timeout after 24 hours (I think). 
>> 
>> Aaron
>> 
>> On 12 Apr 2011, at 23:39, Karl Hiramoto wrote:
>> 
>>> On 12/04/2011 13:31, Jonathan Colby wrote:
>>>> There are a few other threads related to problems with the nodetool repair 
>>>> in 0.7.4.  However I'm not seeing any errors, just never getting a message 
>>>> that the repair completed successfully.
>>>> 
>>>> In my production and test cluster (with just a few MB data)  the repair 
>>>> nodetool prompt never returns and the last entry in the cassandra.log is 
>>>> always something like:
>>>> 
>>>> #<TreeRequest manual-repair-f739ca7a-bef8-4683-b249-09105f6719d9, 
>>>> /10.46.108.102, (DFS,main)>  completed successfully: 1 outstanding
>>>> 
>>>> But I don't see a message, even hours later, that the 1 outstanding 
>>>> request "finished successfully".
>>>> 
>>>> Anyone else experience this?  These are physical server nodes in local 
>>>> data centers and not EC2
>>>> 
>>> 
>>> I've seen this.   To fix it  try a "nodetool compact" then repair.
>>> 
>>> 
>>> --
>>> Karl
>> 
>

Re: repair never completes with "finished successfully"

Reply via email to