great tips. I will investigate further with your suggestions in mind. Hopefully the problem has gone away since I pulled in fresh data on the node with problems.
On Apr 13, 2011, at 3:54 AM, aaron morton wrote: > Ah, unreadable rows and in the validation compaction no less. Makes a little > more sense now. > > Anyone help with the EOF when deserializing columns ? Is the fix to run scrub > or drop the sstable ? > > Here's a a theory, AES is trying to... > > 1) Create TreeRequest 's that specify a range we want to validate. > 2) Send TreeRequest 's to local node and neighbour > 3) Process TreeRequest by running a validation compaction > (CompactionManager.doValidationCompaction in your prev stacks) > 4) When both TreeRequests return back work out the differences and then > stream data if needed. > > Perhaps step 3 is not completing because of errors like > http://www.mail-archive.com/user@cassandra.apache.org/msg12196.html If the > row is over multiple sstables we can skip the row in one sstable. However if > it's in a single sstable PrecompactedRow will raise an IOError if there is a > problem. This is not what is in the linked error stack that shows a row been > skipped, just a hunch we could checkout. > > Do you see an IOErrors (not exceptions) in the logs or exceptions with > doValidationCompaction in the stack? > > For a tree request on the node you start compaction on you should see these > logs... > 1) Waiting for repair requests... > 2) One of "Stored local tree" or "Stored remote tree" depending on which > returns first at DEBUG level > 3) "Queuing comparison" > > If we do not have the 3rd log then we did not get a replay from either local > or remote. > > Aaron > > On 13 Apr 2011, at 00:57, Jonathan Colby wrote: > >> There is no "Repair session" message either. It just starts with a message >> like: >> >> INFO [manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723] 2011-04-10 >> 14:00:59,051 AntiEntropyService.java (line 770) Waiting for repair requests: >> [#<TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, >> /10.46.108.101, (DFS,main)>, #<TreeRequest >> manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.100, >> (DFS,main)>, #<TreeRequest >> manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.102, >> (DFS,main)>, #<TreeRequest >> manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.101, >> (DFS,main)>] >> >> NETSTATS: >> >> Mode: Normal >> Not sending any streams. >> Not receiving any streams. >> Pool Name Active Pending Completed >> Commands n/a 0 150846 >> Responses n/a 0 443183 >> >> One node in our cluster still has "unreadable rows", where the reads trip up >> every time for certain sstables (you've probably seen my earlier threads >> regarding that). My suspicion is that the bloom filter read on the node >> with the corrupt sstables is never reporting back to the repair, thereby >> causing it to hang. >> >> >> What would be great is a scrub tool that ignores unreadable/unserializable >> rows! : ) >> >> >> On Apr 12, 2011, at 2:15 PM, aaron morton wrote: >> >>> Do you see a message starting "Repair session " and ending with "completed >>> successfully" ? >>> >>> Or do you see any streaming activity using "nodetool netstats" >>> >>> Repair can hang if a neighbour dies and fails to send a requested stream. >>> It will timeout after 24 hours (I think). >>> >>> Aaron >>> >>> On 12 Apr 2011, at 23:39, Karl Hiramoto wrote: >>> >>>> On 12/04/2011 13:31, Jonathan Colby wrote: >>>>> There are a few other threads related to problems with the nodetool >>>>> repair in 0.7.4. However I'm not seeing any errors, just never getting a >>>>> message that the repair completed successfully. >>>>> >>>>> In my production and test cluster (with just a few MB data) the repair >>>>> nodetool prompt never returns and the last entry in the cassandra.log is >>>>> always something like: >>>>> >>>>> #<TreeRequest manual-repair-f739ca7a-bef8-4683-b249-09105f6719d9, >>>>> /10.46.108.102, (DFS,main)> completed successfully: 1 outstanding >>>>> >>>>> But I don't see a message, even hours later, that the 1 outstanding >>>>> request "finished successfully". >>>>> >>>>> Anyone else experience this? These are physical server nodes in local >>>>> data centers and not EC2 >>>>> >>>> >>>> I've seen this. To fix it try a "nodetool compact" then repair. >>>> >>>> >>>> -- >>>> Karl >>> >> >