Possibly losing data with corrupted SSTables

Francisco Nogueira Calmon Sobral Wed, 29 Jan 2014 05:12:15 -0800

Dear experts,

We are facing a annoying problem in our cluster.


We have 9 amazon extra large linux nodes, running Cassandra 1.2.11.

The short story is that after moving the data from one cluster to another, 
we've been unable to run 'nodetool repair'. It get stuck due to a 
CorruptSSTableException in some nodes and CFs. After looking at some 
problematic CFs, we observed that some of them have root permissions, instead 
of cassandra permissions. Also, their names are different from the 'good' ones 
as we can see below:

BAD
------
-rw-r--r-- 8 cassandra cassandra 991M Nov  8 15:11 
Sessions-Users-ib-2516-Data.db
-rw-r--r-- 8 cassandra cassandra 703M Nov  8 15:11 
Sessions-Users-ib-2516-Index.db
-rw-r--r-- 8 cassandra cassandra 5.3M Nov 13 11:42 
Sessions-Users-ib-2516-Summary.db

GOOD
---------
-rw-r--r-- 1 cassandra cassandra  22K Jan 15 10:50 
Sessions-Users-ic-2933-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 106M Jan 15 10:50 
Sessions-Users-ic-2933-Data.db
-rw-r--r-- 1 cassandra cassandra 2.2M Jan 15 10:50 
Sessions-Users-ic-2933-Filter.db
-rw-r--r-- 1 cassandra cassandra  76M Jan 15 10:50 
Sessions-Users-ic-2933-Index.db
-rw-r--r-- 1 cassandra cassandra 4.3K Jan 15 10:50 
Sessions-Users-ic-2933-Statistics.db
-rw-r--r-- 1 cassandra cassandra 574K Jan 15 10:50 
Sessions-Users-ic-2933-Summary.db
-rw-r--r-- 1 cassandra cassandra   79 Jan 15 10:50 
Sessions-Users-ic-2933-TOC.txt


We changed the permissions back to 'cassandra' and ran 'nodetool scrub' in this 
problematic CF, but it has been running for at least two weeks (it is not 
frozen) and keeps logging many WARNs while working with the above mentioned 
SSTable:

WARN [CompactionExecutor:15] 2014-01-28 17:01:22,571 OutputHandler.java (line 
57) Non-fatal error reading row (stacktrace follows)
java.io.IOError: java.io.IOException: Impossible row size 3618452438597849419
        at org.apache.cassandra.db.compaction.Scrubber.scrub(Scrubber.java:171)
        at 
org.apache.cassandra.db.compaction.CompactionManager.scrubOne(CompactionManager.java:526)
        at 
org.apache.cassandra.db.compaction.CompactionManager.doScrub(CompactionManager.java:515)
        at 
org.apache.cassandra.db.compaction.CompactionManager.access$400(CompactionManager.java:70)
        at 
org.apache.cassandra.db.compaction.CompactionManager$3.perform(CompactionManager.java:280)
        at 
org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:250)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: Impossible row size 3618452438597849419
        ... 10 more


1) I do not think that deleting all data of one node and running 'nodetool 
rebuild' will work, since we observed that this problem occurs in all nodes. So 
we may not be able to restore all the data. What can be done in this case?

2) Why the permissions of some sstables are 'root'? Is this problem caused by 
our manual migration of data? (see long story below)


How we ran into this?

The long story is that we've tried to move our cluster with sstableloader, but 
it was unable to load all the data correctly. Our solution was to put ALL 
cluster data into EACH new node and run 'nodetool refresh'. I performed this 
task for each node and each column family sequentially. Sometimes I had to 
rename some sstables, because they came from different nodes with the same 
name. I don't remember if I ran 'nodetool repair'  or even 'nodetool cleanup' 
in each node. Apparently, the process was successful, and (almost) all the data 
was moved.

Unfortunately, after 3 months since we moved, I am unable to perform read 
operations in some keys of some CFs. I think that some of these keys belong to 
the above mentioned sstables. 

Any insights are welcome.

Best regards,
Francisco Sobral

Possibly losing data with corrupted SSTables

Reply via email to