great tips.  I will investigate further with your suggestions in mind.  
Hopefully the problem has gone away since I  pulled in fresh data on the node 
with problems.

On Apr 13, 2011, at 3:54 AM, aaron morton wrote:

> Ah, unreadable rows and in the validation compaction no less. Makes a little 
> more sense now. 
> 
> Anyone help with the EOF when deserializing columns ? Is the fix to run scrub 
> or drop the sstable ?
> 
> Here's a a theory, AES is trying to...
> 
> 1) Create TreeRequest 's that specify a range we want to validate. 
> 2) Send TreeRequest 's to local node and neighbour
> 3) Process TreeRequest by running a validation compaction 
> (CompactionManager.doValidationCompaction in your prev stacks)
> 4) When both TreeRequests return back work out the differences and then 
> stream data if needed. 
> 
> Perhaps step 3 is not completing because of errors like 
> http://www.mail-archive.com/user@cassandra.apache.org/msg12196.html If the 
> row is over multiple sstables we can skip the row in one sstable. However if 
> it's in a single sstable PrecompactedRow will raise an IOError if there is a 
> problem. This is not what is in the linked error stack that shows a row been 
> skipped, just a hunch we could checkout.
> 
> Do you see an IOErrors (not exceptions) in the logs or exceptions with 
> doValidationCompaction in the stack?
> 
> For a tree request on the node you start compaction on you should see these 
> logs...
> 1) Waiting for repair requests...
> 2) One of "Stored local tree" or "Stored remote tree" depending on which 
> returns first at DEBUG level
> 3) "Queuing comparison"
> 
> If we do not have the 3rd log then we did not get a replay from either local 
> or remote. 
> 
> Aaron
> 
> On 13 Apr 2011, at 00:57, Jonathan Colby wrote:
> 
>> There is no "Repair session" message either.   It just starts with a message 
>> like:
>> 
>> INFO [manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723] 2011-04-10 
>> 14:00:59,051 AntiEntropyService.java (line 770) Waiting for repair requests: 
>> [#<TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, 
>> /10.46.108.101, (DFS,main)>, #<TreeRequest 
>> manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.100, 
>> (DFS,main)>, #<TreeRequest 
>> manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.102, 
>> (DFS,main)>, #<TreeRequest 
>> manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.101, 
>> (DFS,main)>]
>> 
>> NETSTATS:
>> 
>> Mode: Normal
>> Not sending any streams.
>> Not receiving any streams.
>> Pool Name                    Active   Pending      Completed
>> Commands                        n/a         0         150846
>> Responses                       n/a         0         443183
>> 
>> One node in our cluster still has "unreadable rows", where the reads trip up 
>> every time for certain sstables (you've probably seen my earlier threads 
>> regarding that).   My suspicion is that the bloom filter read on the node 
>> with the corrupt sstables is never reporting back to the repair, thereby 
>> causing it to hang.
>> 
>> 
>> What would be great is a scrub tool that ignores unreadable/unserializable 
>> rows!  : )
>> 
>> 
>> On Apr 12, 2011, at 2:15 PM, aaron morton wrote:
>> 
>>> Do you see a message starting "Repair session " and ending with "completed 
>>> successfully" ?
>>> 
>>> Or do you see any streaming activity using "nodetool netstats"
>>> 
>>> Repair can hang if a neighbour dies and fails to send a requested stream. 
>>> It will timeout after 24 hours (I think). 
>>> 
>>> Aaron
>>> 
>>> On 12 Apr 2011, at 23:39, Karl Hiramoto wrote:
>>> 
>>>> On 12/04/2011 13:31, Jonathan Colby wrote:
>>>>> There are a few other threads related to problems with the nodetool 
>>>>> repair in 0.7.4.  However I'm not seeing any errors, just never getting a 
>>>>> message that the repair completed successfully.
>>>>> 
>>>>> In my production and test cluster (with just a few MB data)  the repair 
>>>>> nodetool prompt never returns and the last entry in the cassandra.log is 
>>>>> always something like:
>>>>> 
>>>>> #<TreeRequest manual-repair-f739ca7a-bef8-4683-b249-09105f6719d9, 
>>>>> /10.46.108.102, (DFS,main)>  completed successfully: 1 outstanding
>>>>> 
>>>>> But I don't see a message, even hours later, that the 1 outstanding 
>>>>> request "finished successfully".
>>>>> 
>>>>> Anyone else experience this?  These are physical server nodes in local 
>>>>> data centers and not EC2
>>>>> 
>>>> 
>>>> I've seen this.   To fix it  try a "nodetool compact" then repair.
>>>> 
>>>> 
>>>> --
>>>> Karl
>>> 
>> 
> 

Reply via email to