RE: RE: SEVERE Data Corruption Problems

Dan Hendry Fri, 11 Feb 2011 10:18:07 -0800

Here is my chronology:

-          I turned on my current cluster in early December, starting with
about 0.7 RC2 or thereabouts. I ran into a number of problems but was able
to get things more or less ironed out (upgrading to new versions pretty
quickly). Once 0.7.0 was released, I had no problems for about 2 weeks. In
that time (using 0.7.0), I ran frequent compactions but never a repair (not
really needed with my workload: read repair works fine and I use expiring
columns)


-          Sometime mid-January (~21) I hit my first instance of sstable
corruption. I was able to fix things by deleting EVERYTHING related to the
problematic CF on the problematic node (sstables, hints column family, saved
caches, the works). I then ran a repair which failed
(http://www.mail-archive.com/user@cassandra.apache.org/msg09154.html)
possibly due to CASSANDRA-1992 or some such. I was able to delete sstables,
run cleanup and compaction, truncate unimportant CFs and get my service
operating again

-          A week or two ago I created a new CF which turned out to be the
most problematic in terms of SSTable corruption. Since creating this CF via
CLI (again, all nodes running 0.7.0), there have been no repairs run and no
ring changes (moving, bootstrapping or decommissioning). Unless I
misunderstand CASSANDRA-1992, I don't think it is related. I have been
seeing steadily increasing rates of Cassandra exceptions on multiple CFs,
reads failing, compaction failing, hinted handoff failing, inability to
export sstables2json, etc.

 

Now I cant even understand how any of the RC bugs I hit back in the day are
still around given I was running 0.7.0 for weeks and ran multiple full
compactions without issue. Regardless, they could not have affected then new
CF. I have seen no improvement for the one node I put on the 0.7 branch (r
1069600). I am seeing the same behaviour for multiple failed compactions (as
per the bug I filed: https://issues.apache.org/jira/browse/CASSANDRA-2084)
and am still investigating the read exceptions. Jonathan suggested a
completely clean 0.7.1 install which would be a lot of work given I would
have to migrate data at an application level but I would be willing to do so
if I had any confidence it would help.

 

Here is an example of one SSTable on the node running 0.7. Notice that a
bunch of seemingly fine sstables (no problems reading from it that I could
determine but my tests were not rigorous and this happened overnight), which
had to either be created or compacted at least once (given -f-) on 0.7
(with, for example, the CASSANDRA-1992 and CASSANDRA-1999 fixes) gets
compacted to a new SSTable that is completely f'ed. 

 

INFO [CompactionExecutor:1] 2011-02-10 20:25:42,200 CompactionManager.java
(line 373) Compacting
[org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data
/kikmetrics/ProcessingBuckets-f-4311-Data.db'),org.apache.cassandra.io.sstab
le.SSTableReader(path='/var/lib/cassandra/data/kikmetrics/ProcessingBuckets-
f-4312-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/li
b/cassandra/data/kikmetrics/ProcessingBuckets-f-4313-Data.db'),org.apache.ca
ssandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/kikmetrics/Pr
ocessingBuckets-f-4314-Data.db')]

...

INFO [CompactionExecutor:1] 2011-02-10 20:25:49,908 CompactionIterator.java
(line 135) Compacting large row 555345525f4556454e547c5345434f4e445f3130
(393599737 bytes) incrementally

...

INFO [CompactionExecutor:1] 2011-02-10 20:26:42,989 CompactionManager.java
(line 458) Compacted to
/var/lib/cassandra/data/kikmetrics/ProcessingBuckets-tmp-f-4315-Data.db.
487,085,029 to 477,567,892 (~98% of original) bytes for 12 keys.  Time:
60,788ms.

... 

... this sstable does not appear in the logs unitll:

... 

INFO [CompactionExecutor:1] 2011-02-10 22:51:30,613 CompactionManager.java
(line 373) Compacting
[org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data
/kikmetrics/ProcessingBuckets-e-4295-Data.db'),org.apache.cassandra.io.sstab
le.SSTableReader(path='/var/lib/cassandra/data/kikmetrics/ProcessingBuckets-
f-4315-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/li
b/cassandra/data/kikmetrics/ProcessingBuckets-f-4320-Data.db'),org.apache.ca
ssandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/kikmetrics/Pr
ocessingBuckets-f-4327-Data.db')]

...

... Note the 'File :' output is a minor change I made (obviously)

...

ERROR [CompactionExecutor:1] 2011-02-10 22:51:49,029
AbstractCassandraDaemon.java (line 114) Fatal exception in thread
Thread[CompactionExecutor:1,1,main]

java.io.IOError: java.io.IOException: File
:/var/lib/cassandra/data/kikmetrics/ProcessingBuckets-f-4315-Data.db

        at
org.apache.cassandra.io.sstable.SSTableIdentityIterator.<init>(SSTableIdenti
tyIterator.java:85)

        at
org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTa
bleScanner.java:179)

        at
org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTa
bleScanner.java:144)

        at
org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:136)

        at
org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:39)

        at
org.apache.commons.collections.iterators.CollatingIterator.set(CollatingIter
ator.java:284)

        at
org.apache.commons.collections.iterators.CollatingIterator.least(CollatingIt
erator.java:326)

        at
org.apache.commons.collections.iterators.CollatingIterator.next(CollatingIte
rator.java:230)

        at
org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.jav
a:68)

        at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator
.java:136)

        at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131
)

        at
org.apache.commons.collections.iterators.FilterIterator.setNextObject(Filter
Iterator.java:183)

        at
org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterat
or.java:94)

        at
org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.jav
a:427)

       at
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:123)

        at
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:93)

        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

        at java.util.concurrent.FutureTask.run(FutureTask.java:138)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
va:886)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)

        at java.lang.Thread.run(Thread.java:662)

Caused by: java.io.IOException: File
:/var/lib/cassandra/data/kikmetrics/ProcessingBuckets-f-4315-Data.db

        ... 21 more

Caused by: java.io.EOFException: attempted to skip 67113066 bytes but only
skipped 27852647

        at
org.apache.cassandra.io.sstable.IndexHelper.skipBloomFilter(IndexHelper.java
:51)

        at
org.apache.cassandra.io.sstable.SSTableIdentityIterator.<init>(SSTableIdenti
tyIterator.java:69)

        ... 20 more

INFO [CompactionExecutor:1] 2011-02-10 22:51:49,030 CompactionManager.java
(line 373) Compacting
[org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data
/kikmetrics/ProcessingBuckets-e-4295-Data.db'),org.apache.cassandra.io.sstab
le.SSTableReader(path='/var/lib/cassandra/data/kikmetrics/ProcessingBuckets-
f-4315-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/li
b/cassandra/data/kikmetrics/ProcessingBuckets-f-4320-Data.db'),org.apache.ca
ssandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/kikmetrics/Pr
ocessingBuckets-f-4327-Data.db')]

... etc ...

 

 

 

Now when trying to export ProcessingBuckets-f-4315-Data.db using
sstable2json, I get the following: 

 

Exception in thread "main" java.io.IOError: java.io.EOFException

                at
org.apache.cassandra.io.util.ColumnIterator.deserializeNext(ColumnSortedMap.
java:246)

                at
org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:262)

                at
org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:223)

                at
java.util.concurrent.ConcurrentSkipListMap.buildFromSorted(ConcurrentSkipLis
tMap.java:1493)

                at
java.util.concurrent.ConcurrentSkipListMap.<init>(ConcurrentSkipListMap.java
:1443)

                at
org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:3
66)

                at
org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:3
14)

                at
org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetche
r.getNextBlock(IndexedSliceReader.java:181)

                at
org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(Indexe
dSliceReader.java:120)

                at
org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(Indexe
dSliceReader.java:48)

                at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator
.java:136)

                at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131
)

                at
org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableS
liceIterator.java:108)

                at
org.apache.cassandra.tools.SSTableExport.serializeRow(SSTableExport.java:178
)

                at
org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:353)

                at
org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:375)

                at
org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:388)

                at
org.apache.cassandra.tools.SSTableExport.main(SSTableExport.java:446)

Caused by: java.io.EOFException

                at
java.io.RandomAccessFile.readByte(RandomAccessFile.java:591)

                at
org.apache.cassandra.utils.ByteBufferUtil.readShortLength(ByteBufferUtil.jav
a:287)

                at
org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil
.java:298)

                at
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:6
6)

                at
org.apache.cassandra.io.util.ColumnIterator.deserializeNext(ColumnSortedMap.
java:242)

                ... 17 more

 

:(

 

Dan

 

From: Aaron Morton [mailto:aa...@thelastpickle.com] 
Sent: February-10-11 20:16
To: dev@cassandra.apache.org
Subject: Re: RE: SEVERE Data Corruption Problems

 

Looks like the bloom filter for the row is corrupted, does it happen for all
reads or just for reads on one row ? After the upgrade to 0.7 (assuming an
0.7 nightly build) did you run anything like nodetool repair ? 

 

Have you tried asking on the #cassandra IRC room to see if their are any
comitters around ? 

Aaron


On 11 Feb, 2011,at 01:18 PM, Dan Hendry <dan.hendry.j...@gmail.com> wrote:

Upgraded one node to 0.7. Its logging exceptions like mad (thousands per
minute). All like below (which is fairly new to me):

ERROR [ReadStage:721] 2011-02-10 18:13:56,190 AbstractCassandraDaemon.java
(line 114) Fatal exception in thread Threa
d[ReadStage:721,5,main]
java.io.IOError: java.io.EOFException
at
org.apache.cassandra.db.columniterator.SSTableNamesIterator.<init>(SSTableNa
mesIterator.java:75)
at
org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(Nam
esQueryFilter.java:59)
at
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFil
ter.java:80)
at
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilySto
re.java:1275)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.
java:1167)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.
java:1095)
at org.apache.cassandra.db.Table.getRow(Table.java:384)
at
org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadComma
nd.java:60)
at
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(Stor
ageProxy.java:473)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
va:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at
org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSeri
alizer.java:48)
at
org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSeri
alizer.java:30)
at
org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.
java:108)
at
org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableName
sIterator.java:106)
at
org.apache.cassandra.db.columniterator.SSTableNamesIterator.<init>(SSTableNa
mesIterator.java:71)
... 12 more

Dan


-----Original Message-----
From: Jonathan Ellis [mailto:jbel...@gmail.com] 
Sent: February-09-11 18:14
To: dev
Subject: Re: SEVERE Data Corruption Problems

Hi Dan,

it would be very useful to test with 0.7 branch instead of 0.7.0 so at
least you're not chasing known and fixed bugs like CASSANDRA-1992.

As you say, there's a lot of people who aren't seeing this, so it
would also be useful if you can provide some kind of test harness
where you can say "point this at a cluster and within a few hours

On Wed, Feb 9, 2011 at 4:31 PM, Dan Hendry <dan.hendry.j...@gmail.com>
wrote:
> I have been having SEVERE data corruption issues with SSTables in my
> cluster, for one CF it was happening almost daily (I have since shut down
> the service using that CF as it was too much work to manage the Cassandra
> errors). At this point, I can't see how it is anything but a Cassandra bug
> yet it's somewhat strange and very scary that I am the only one who seems
to
> be having such serious issues. Most of my data is indexed in two ways so I
> have been able to write a validator which goes through and back fills
> missing data but it's kind of defeating the whole point of Cassandra. The
> only way I have found to deal with issues when they crop up to prevent
nodes
> crashing from repeated failed compactions is delete the SSTable. My
cluster
> is running a slightly modified 0.7.0 version which logs what files errors
> for so that I can stop the node and delete them.
>
>
>
> The problem:
>
> -          Reads, compactions and hinted handoff fail with various
> exceptions (samples shown at the end of this email) which seem to indicate
> sstable corruption.
>
> -          I have seen failed reads/compactions/hinted handoff on 4 out of
4
> nodes (RF=2) for 3 different super column families and 1 standard column
> family (4 out of 11) and just now, the Hints system CF. (if it matters the
> ring has not changed since one CF which has been giving me trouble was
> created). I have check SMART disk info and run various diagnostics and
there
> does not seem to be any hardware issues, plus what are the chances of all
> four nodes having the same hardware problems at the same time when for all
> other purposes, they appear fine?
>
> -          I have added logging which outputs what sstable are causing
> exceptions to be thrown. The corrupt sstables have been both freshly
flushed
> memtables and the output of compaction (ie, 4 sstables which all seem to
be
> fine get compacted to 1 which is then corrupt). It seems that the majority
> of corrupt sstables are post-compacted (vs post-memtable flush).
>
> -          The one CF which was giving me the most problems was heavily
> written to (1000-1500 writes/second continually across the cluster). For
> that cf, was having to deleting 4-6 sstables a day across the cluster (and
> the number was going up, even the number of problems for remaining CFs is
> going up). The other CFs which have had corrupt sstables are also quite
> heavily written to (generally a few hundred writes a second across the
> cluster).
>
> -          Most of the time (5/6 attempts) when this problem occurs,
> sstable2json also fails. I have however, had one case where I was able to
> export the sstable to json, then re-import it at which point I was no
longer
> seeing exceptions.
>
> -          The cluster has been running for a little over 2 months now,
> problem seems to have sprung up in the last 3-4 weeks and seems to be
> steadily getting worse.
>
>
>
> Ultimately, I think I am hitting some subtle race condition somewhere. I
> have been starting to dig into the Cassandra code but I barely know where
to
> start looking. I realize I have not provided nearly enough information to
> easily debug the problem but PLEASE keep your eyes open for possibly racy
or
> buggy code which could cause these sorts of problems. I am willing to
> provided full Cassandra logs and a corrupt SSTable on an individual basis:
> please email me and let me know.
>
>
>
> Here is possibly relevant information and my theories on a possible root
> cause. Again, I know little about the Cassandra code base and have only
> moderate java experience so these theories may be way off base.
>
> -          Strictly speaking, I probably don't have enough memory for my
> workload. I see stop the world gc occurring ~30/day/node, often causing
> Cassandra to hang for 30+ seconds (according to the gc logs). Could there
be
> some java bug where a full gc in the middle of writing or flushing
> (compaction/memtable flush) or doing some other disk based activity causes
> some sort of data corruption?
>
> -          Writes are usually done at ConsistencyLevel ONE with additional
> client side retry logic. Given that I often see consecutive nodes in the
> ring down, could there be some edge condition where dying at just the
right
> time causes parts of mutations/messages to be lost?
>
> -          All of the CFs which have been causing me problems have large
> rows which are compacted incrementally. Could there be some problem with
the
> incremental compaction logic?
>
> -          My cluster has a fairly heavy write load (again, the most
> problematic CF is getting 1500 (w/s)/(RF=2) = 750 writes/second/node).
> Furthermore, it is highly probable that there are timestamp collisions.
> Could there be some issue with timestamp logic (ie, using > instead of >=
or
> some such) during flushes/compaction?
>
> -          Once a node
>
>
>
> Cluster/system information:
>
> -          4 nodes with RF=2
>
> -          Nodes have 8 cores with 24 GB of RAM a piece.
>
> -          2 HDs, 1 for commit log/system, 1 for /var/lib/cassandra/data
>
> -          OS is Ubuntu 10.04 (uname -r = 2.6.32-24-server)
>
> -          Java:
>
> o   java version "1.6.0_22"
>
> o   Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
>
> o   Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
>
> -          Slightly modified (file information in exceptions) version of
> 0.7.0
>
>
>
> The following non-standard cassandra.yaml properties have been changed:
>
> -          commitlog_sync_period_in_ms: 100 (with commitlog_sync:
periodic)
>
> -          disk_access_mode: mmap_index_only
>
> -          concurrent_reads: 12
>
> -          concurrent_writes: 2 (was 32, but I dropped it to 2 to try and
> eliminate any mutation race conditions - did not seem to help)
>
> -          sliced_buffer_size_in_kb: 128
>
> -          in_memory_compaction_limit_in_mb: 50
>
> -          rpc_timeout_in_ms: 15000
>
>
>
> Schema for most problematic CF:
>
> name: DeviceEventsByDevice
>
> column_type: Standard
>
> memtable_throughput_in_mb: 150
>
> memtable_operations_in_millions: 1.5
>
> gc_grace_seconds: 172800
>
> keys_cached: 1000000
>
> rows_cached: 0
>
>
>
> Dan Hendry
>
> (403) 660-2297
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com
No virus found in this incoming message.
Checked by AVG - www.avg.com <http://www.avgcom>  
Version: 9.0872 / Virus Database: 271.1.1/3432 - Release Date: 02/09/11
02:34:00

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.872 / Virus Database: 271.1.1/3436 - Release Date: 02/11/11
02:34:00

RE: RE: SEVERE Data Corruption Problems

Reply via email to