Here is my chronology: - I turned on my current cluster in early December, starting with about 0.7 RC2 or thereabouts. I ran into a number of problems but was able to get things more or less ironed out (upgrading to new versions pretty quickly). Once 0.7.0 was released, I had no problems for about 2 weeks. In that time (using 0.7.0), I ran frequent compactions but never a repair (not really needed with my workload: read repair works fine and I use expiring columns)
- Sometime mid-January (~21) I hit my first instance of sstable corruption. I was able to fix things by deleting EVERYTHING related to the problematic CF on the problematic node (sstables, hints column family, saved caches, the works). I then ran a repair which failed (http://www.mail-archive.com/user@cassandra.apache.org/msg09154.html) possibly due to CASSANDRA-1992 or some such. I was able to delete sstables, run cleanup and compaction, truncate unimportant CFs and get my service operating again - A week or two ago I created a new CF which turned out to be the most problematic in terms of SSTable corruption. Since creating this CF via CLI (again, all nodes running 0.7.0), there have been no repairs run and no ring changes (moving, bootstrapping or decommissioning). Unless I misunderstand CASSANDRA-1992, I don't think it is related. I have been seeing steadily increasing rates of Cassandra exceptions on multiple CFs, reads failing, compaction failing, hinted handoff failing, inability to export sstables2json, etc. Now I cant even understand how any of the RC bugs I hit back in the day are still around given I was running 0.7.0 for weeks and ran multiple full compactions without issue. Regardless, they could not have affected then new CF. I have seen no improvement for the one node I put on the 0.7 branch (r 1069600). I am seeing the same behaviour for multiple failed compactions (as per the bug I filed: https://issues.apache.org/jira/browse/CASSANDRA-2084) and am still investigating the read exceptions. Jonathan suggested a completely clean 0.7.1 install which would be a lot of work given I would have to migrate data at an application level but I would be willing to do so if I had any confidence it would help. Here is an example of one SSTable on the node running 0.7. Notice that a bunch of seemingly fine sstables (no problems reading from it that I could determine but my tests were not rigorous and this happened overnight), which had to either be created or compacted at least once (given -f-) on 0.7 (with, for example, the CASSANDRA-1992 and CASSANDRA-1999 fixes) gets compacted to a new SSTable that is completely f'ed. INFO [CompactionExecutor:1] 2011-02-10 20:25:42,200 CompactionManager.java (line 373) Compacting [org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data /kikmetrics/ProcessingBuckets-f-4311-Data.db'),org.apache.cassandra.io.sstab le.SSTableReader(path='/var/lib/cassandra/data/kikmetrics/ProcessingBuckets- f-4312-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/li b/cassandra/data/kikmetrics/ProcessingBuckets-f-4313-Data.db'),org.apache.ca ssandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/kikmetrics/Pr ocessingBuckets-f-4314-Data.db')] ... INFO [CompactionExecutor:1] 2011-02-10 20:25:49,908 CompactionIterator.java (line 135) Compacting large row 555345525f4556454e547c5345434f4e445f3130 (393599737 bytes) incrementally ... INFO [CompactionExecutor:1] 2011-02-10 20:26:42,989 CompactionManager.java (line 458) Compacted to /var/lib/cassandra/data/kikmetrics/ProcessingBuckets-tmp-f-4315-Data.db. 487,085,029 to 477,567,892 (~98% of original) bytes for 12 keys. Time: 60,788ms. ... ... this sstable does not appear in the logs unitll: ... INFO [CompactionExecutor:1] 2011-02-10 22:51:30,613 CompactionManager.java (line 373) Compacting [org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data /kikmetrics/ProcessingBuckets-e-4295-Data.db'),org.apache.cassandra.io.sstab le.SSTableReader(path='/var/lib/cassandra/data/kikmetrics/ProcessingBuckets- f-4315-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/li b/cassandra/data/kikmetrics/ProcessingBuckets-f-4320-Data.db'),org.apache.ca ssandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/kikmetrics/Pr ocessingBuckets-f-4327-Data.db')] ... ... Note the 'File :' output is a minor change I made (obviously) ... ERROR [CompactionExecutor:1] 2011-02-10 22:51:49,029 AbstractCassandraDaemon.java (line 114) Fatal exception in thread Thread[CompactionExecutor:1,1,main] java.io.IOError: java.io.IOException: File :/var/lib/cassandra/data/kikmetrics/ProcessingBuckets-f-4315-Data.db at org.apache.cassandra.io.sstable.SSTableIdentityIterator.<init>(SSTableIdenti tyIterator.java:85) at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTa bleScanner.java:179) at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTa bleScanner.java:144) at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:136) at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:39) at org.apache.commons.collections.iterators.CollatingIterator.set(CollatingIter ator.java:284) at org.apache.commons.collections.iterators.CollatingIterator.least(CollatingIt erator.java:326) at org.apache.commons.collections.iterators.CollatingIterator.next(CollatingIte rator.java:230) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.jav a:68) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator .java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131 ) at org.apache.commons.collections.iterators.FilterIterator.setNextObject(Filter Iterator.java:183) at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterat or.java:94) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.jav a:427) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:123) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:93) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja va:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9 08) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: File :/var/lib/cassandra/data/kikmetrics/ProcessingBuckets-f-4315-Data.db ... 21 more Caused by: java.io.EOFException: attempted to skip 67113066 bytes but only skipped 27852647 at org.apache.cassandra.io.sstable.IndexHelper.skipBloomFilter(IndexHelper.java :51) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.<init>(SSTableIdenti tyIterator.java:69) ... 20 more INFO [CompactionExecutor:1] 2011-02-10 22:51:49,030 CompactionManager.java (line 373) Compacting [org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data /kikmetrics/ProcessingBuckets-e-4295-Data.db'),org.apache.cassandra.io.sstab le.SSTableReader(path='/var/lib/cassandra/data/kikmetrics/ProcessingBuckets- f-4315-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/li b/cassandra/data/kikmetrics/ProcessingBuckets-f-4320-Data.db'),org.apache.ca ssandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/kikmetrics/Pr ocessingBuckets-f-4327-Data.db')] ... etc ... Now when trying to export ProcessingBuckets-f-4315-Data.db using sstable2json, I get the following: Exception in thread "main" java.io.IOError: java.io.EOFException at org.apache.cassandra.io.util.ColumnIterator.deserializeNext(ColumnSortedMap. java:246) at org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:262) at org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:223) at java.util.concurrent.ConcurrentSkipListMap.buildFromSorted(ConcurrentSkipLis tMap.java:1493) at java.util.concurrent.ConcurrentSkipListMap.<init>(ConcurrentSkipListMap.java :1443) at org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:3 66) at org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:3 14) at org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetche r.getNextBlock(IndexedSliceReader.java:181) at org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(Indexe dSliceReader.java:120) at org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(Indexe dSliceReader.java:48) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator .java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131 ) at org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableS liceIterator.java:108) at org.apache.cassandra.tools.SSTableExport.serializeRow(SSTableExport.java:178 ) at org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:353) at org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:375) at org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:388) at org.apache.cassandra.tools.SSTableExport.main(SSTableExport.java:446) Caused by: java.io.EOFException at java.io.RandomAccessFile.readByte(RandomAccessFile.java:591) at org.apache.cassandra.utils.ByteBufferUtil.readShortLength(ByteBufferUtil.jav a:287) at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil .java:298) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:6 6) at org.apache.cassandra.io.util.ColumnIterator.deserializeNext(ColumnSortedMap. java:242) ... 17 more :( Dan From: Aaron Morton [mailto:aa...@thelastpickle.com] Sent: February-10-11 20:16 To: dev@cassandra.apache.org Subject: Re: RE: SEVERE Data Corruption Problems Looks like the bloom filter for the row is corrupted, does it happen for all reads or just for reads on one row ? After the upgrade to 0.7 (assuming an 0.7 nightly build) did you run anything like nodetool repair ? Have you tried asking on the #cassandra IRC room to see if their are any comitters around ? Aaron On 11 Feb, 2011,at 01:18 PM, Dan Hendry <dan.hendry.j...@gmail.com> wrote: Upgraded one node to 0.7. Its logging exceptions like mad (thousands per minute). All like below (which is fairly new to me): ERROR [ReadStage:721] 2011-02-10 18:13:56,190 AbstractCassandraDaemon.java (line 114) Fatal exception in thread Threa d[ReadStage:721,5,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.db.columniterator.SSTableNamesIterator.<init>(SSTableNa mesIterator.java:75) at org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(Nam esQueryFilter.java:59) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFil ter.java:80) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilySto re.java:1275) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore. java:1167) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore. java:1095) at org.apache.cassandra.db.Table.getRow(Table.java:384) at org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadComma nd.java:60) at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(Stor ageProxy.java:473) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja va:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9 08) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSeri alizer.java:48) at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSeri alizer.java:30) at org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper. java:108) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableName sIterator.java:106) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.<init>(SSTableNa mesIterator.java:71) ... 12 more Dan -----Original Message----- From: Jonathan Ellis [mailto:jbel...@gmail.com] Sent: February-09-11 18:14 To: dev Subject: Re: SEVERE Data Corruption Problems Hi Dan, it would be very useful to test with 0.7 branch instead of 0.7.0 so at least you're not chasing known and fixed bugs like CASSANDRA-1992. As you say, there's a lot of people who aren't seeing this, so it would also be useful if you can provide some kind of test harness where you can say "point this at a cluster and within a few hours On Wed, Feb 9, 2011 at 4:31 PM, Dan Hendry <dan.hendry.j...@gmail.com> wrote: > I have been having SEVERE data corruption issues with SSTables in my > cluster, for one CF it was happening almost daily (I have since shut down > the service using that CF as it was too much work to manage the Cassandra > errors). At this point, I can't see how it is anything but a Cassandra bug > yet it's somewhat strange and very scary that I am the only one who seems to > be having such serious issues. Most of my data is indexed in two ways so I > have been able to write a validator which goes through and back fills > missing data but it's kind of defeating the whole point of Cassandra. The > only way I have found to deal with issues when they crop up to prevent nodes > crashing from repeated failed compactions is delete the SSTable. My cluster > is running a slightly modified 0.7.0 version which logs what files errors > for so that I can stop the node and delete them. > > > > The problem: > > - Reads, compactions and hinted handoff fail with various > exceptions (samples shown at the end of this email) which seem to indicate > sstable corruption. > > - I have seen failed reads/compactions/hinted handoff on 4 out of 4 > nodes (RF=2) for 3 different super column families and 1 standard column > family (4 out of 11) and just now, the Hints system CF. (if it matters the > ring has not changed since one CF which has been giving me trouble was > created). I have check SMART disk info and run various diagnostics and there > does not seem to be any hardware issues, plus what are the chances of all > four nodes having the same hardware problems at the same time when for all > other purposes, they appear fine? > > - I have added logging which outputs what sstable are causing > exceptions to be thrown. The corrupt sstables have been both freshly flushed > memtables and the output of compaction (ie, 4 sstables which all seem to be > fine get compacted to 1 which is then corrupt). It seems that the majority > of corrupt sstables are post-compacted (vs post-memtable flush). > > - The one CF which was giving me the most problems was heavily > written to (1000-1500 writes/second continually across the cluster). For > that cf, was having to deleting 4-6 sstables a day across the cluster (and > the number was going up, even the number of problems for remaining CFs is > going up). The other CFs which have had corrupt sstables are also quite > heavily written to (generally a few hundred writes a second across the > cluster). > > - Most of the time (5/6 attempts) when this problem occurs, > sstable2json also fails. I have however, had one case where I was able to > export the sstable to json, then re-import it at which point I was no longer > seeing exceptions. > > - The cluster has been running for a little over 2 months now, > problem seems to have sprung up in the last 3-4 weeks and seems to be > steadily getting worse. > > > > Ultimately, I think I am hitting some subtle race condition somewhere. I > have been starting to dig into the Cassandra code but I barely know where to > start looking. I realize I have not provided nearly enough information to > easily debug the problem but PLEASE keep your eyes open for possibly racy or > buggy code which could cause these sorts of problems. I am willing to > provided full Cassandra logs and a corrupt SSTable on an individual basis: > please email me and let me know. > > > > Here is possibly relevant information and my theories on a possible root > cause. Again, I know little about the Cassandra code base and have only > moderate java experience so these theories may be way off base. > > - Strictly speaking, I probably don't have enough memory for my > workload. I see stop the world gc occurring ~30/day/node, often causing > Cassandra to hang for 30+ seconds (according to the gc logs). Could there be > some java bug where a full gc in the middle of writing or flushing > (compaction/memtable flush) or doing some other disk based activity causes > some sort of data corruption? > > - Writes are usually done at ConsistencyLevel ONE with additional > client side retry logic. Given that I often see consecutive nodes in the > ring down, could there be some edge condition where dying at just the right > time causes parts of mutations/messages to be lost? > > - All of the CFs which have been causing me problems have large > rows which are compacted incrementally. Could there be some problem with the > incremental compaction logic? > > - My cluster has a fairly heavy write load (again, the most > problematic CF is getting 1500 (w/s)/(RF=2) = 750 writes/second/node). > Furthermore, it is highly probable that there are timestamp collisions. > Could there be some issue with timestamp logic (ie, using > instead of >= or > some such) during flushes/compaction? > > - Once a node > > > > Cluster/system information: > > - 4 nodes with RF=2 > > - Nodes have 8 cores with 24 GB of RAM a piece. > > - 2 HDs, 1 for commit log/system, 1 for /var/lib/cassandra/data > > - OS is Ubuntu 10.04 (uname -r = 2.6.32-24-server) > > - Java: > > o java version "1.6.0_22" > > o Java(TM) SE Runtime Environment (build 1.6.0_22-b04) > > o Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode) > > - Slightly modified (file information in exceptions) version of > 0.7.0 > > > > The following non-standard cassandra.yaml properties have been changed: > > - commitlog_sync_period_in_ms: 100 (with commitlog_sync: periodic) > > - disk_access_mode: mmap_index_only > > - concurrent_reads: 12 > > - concurrent_writes: 2 (was 32, but I dropped it to 2 to try and > eliminate any mutation race conditions - did not seem to help) > > - sliced_buffer_size_in_kb: 128 > > - in_memory_compaction_limit_in_mb: 50 > > - rpc_timeout_in_ms: 15000 > > > > Schema for most problematic CF: > > name: DeviceEventsByDevice > > column_type: Standard > > memtable_throughput_in_mb: 150 > > memtable_operations_in_millions: 1.5 > > gc_grace_seconds: 172800 > > keys_cached: 1000000 > > rows_cached: 0 > > > > Dan Hendry > > (403) 660-2297 > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com No virus found in this incoming message. Checked by AVG - www.avg.com <http://www.avgcom> Version: 9.0872 / Virus Database: 271.1.1/3432 - Release Date: 02/09/11 02:34:00 No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.872 / Virus Database: 271.1.1/3436 - Release Date: 02/11/11 02:34:00