Hi Dan, it would be very useful to test with 0.7 branch instead of 0.7.0 so at least you're not chasing known and fixed bugs like CASSANDRA-1992.
As you say, there's a lot of people who aren't seeing this, so it would also be useful if you can provide some kind of test harness where you can say "point this at a cluster and within a few hours On Wed, Feb 9, 2011 at 4:31 PM, Dan Hendry <dan.hendry.j...@gmail.com> wrote: > I have been having SEVERE data corruption issues with SSTables in my > cluster, for one CF it was happening almost daily (I have since shut down > the service using that CF as it was too much work to manage the Cassandra > errors). At this point, I can’t see how it is anything but a Cassandra bug > yet it’s somewhat strange and very scary that I am the only one who seems to > be having such serious issues. Most of my data is indexed in two ways so I > have been able to write a validator which goes through and back fills > missing data but it’s kind of defeating the whole point of Cassandra. The > only way I have found to deal with issues when they crop up to prevent nodes > crashing from repeated failed compactions is delete the SSTable. My cluster > is running a slightly modified 0.7.0 version which logs what files errors > for so that I can stop the node and delete them. > > > > The problem: > > - Reads, compactions and hinted handoff fail with various > exceptions (samples shown at the end of this email) which seem to indicate > sstable corruption. > > - I have seen failed reads/compactions/hinted handoff on 4 out of 4 > nodes (RF=2) for 3 different super column families and 1 standard column > family (4 out of 11) and just now, the Hints system CF. (if it matters the > ring has not changed since one CF which has been giving me trouble was > created). I have check SMART disk info and run various diagnostics and there > does not seem to be any hardware issues, plus what are the chances of all > four nodes having the same hardware problems at the same time when for all > other purposes, they appear fine? > > - I have added logging which outputs what sstable are causing > exceptions to be thrown. The corrupt sstables have been both freshly flushed > memtables and the output of compaction (ie, 4 sstables which all seem to be > fine get compacted to 1 which is then corrupt). It seems that the majority > of corrupt sstables are post-compacted (vs post-memtable flush). > > - The one CF which was giving me the most problems was heavily > written to (1000-1500 writes/second continually across the cluster). For > that cf, was having to deleting 4-6 sstables a day across the cluster (and > the number was going up, even the number of problems for remaining CFs is > going up). The other CFs which have had corrupt sstables are also quite > heavily written to (generally a few hundred writes a second across the > cluster). > > - Most of the time (5/6 attempts) when this problem occurs, > sstable2json also fails. I have however, had one case where I was able to > export the sstable to json, then re-import it at which point I was no longer > seeing exceptions. > > - The cluster has been running for a little over 2 months now, > problem seems to have sprung up in the last 3-4 weeks and seems to be > steadily getting worse. > > > > Ultimately, I think I am hitting some subtle race condition somewhere. I > have been starting to dig into the Cassandra code but I barely know where to > start looking. I realize I have not provided nearly enough information to > easily debug the problem but PLEASE keep your eyes open for possibly racy or > buggy code which could cause these sorts of problems. I am willing to > provided full Cassandra logs and a corrupt SSTable on an individual basis: > please email me and let me know. > > > > Here is possibly relevant information and my theories on a possible root > cause. Again, I know little about the Cassandra code base and have only > moderate java experience so these theories may be way off base. > > - Strictly speaking, I probably don’t have enough memory for my > workload. I see stop the world gc occurring ~30/day/node, often causing > Cassandra to hang for 30+ seconds (according to the gc logs). Could there be > some java bug where a full gc in the middle of writing or flushing > (compaction/memtable flush) or doing some other disk based activity causes > some sort of data corruption? > > - Writes are usually done at ConsistencyLevel ONE with additional > client side retry logic. Given that I often see consecutive nodes in the > ring down, could there be some edge condition where dying at just the right > time causes parts of mutations/messages to be lost? > > - All of the CFs which have been causing me problems have large > rows which are compacted incrementally. Could there be some problem with the > incremental compaction logic? > > - My cluster has a fairly heavy write load (again, the most > problematic CF is getting 1500 (w/s)/(RF=2) = 750 writes/second/node). > Furthermore, it is highly probable that there are timestamp collisions. > Could there be some issue with timestamp logic (ie, using > instead of >= or > some such) during flushes/compaction? > > - Once a node > > > > Cluster/system information: > > - 4 nodes with RF=2 > > - Nodes have 8 cores with 24 GB of RAM a piece. > > - 2 HDs, 1 for commit log/system, 1 for /var/lib/cassandra/data > > - OS is Ubuntu 10.04 (uname –r = 2.6.32-24-server) > > - Java: > > o java version "1.6.0_22" > > o Java(TM) SE Runtime Environment (build 1.6.0_22-b04) > > o Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode) > > - Slightly modified (file information in exceptions) version of > 0.7.0 > > > > The following non-standard cassandra.yaml properties have been changed: > > - commitlog_sync_period_in_ms: 100 (with commitlog_sync: periodic) > > - disk_access_mode: mmap_index_only > > - concurrent_reads: 12 > > - concurrent_writes: 2 (was 32, but I dropped it to 2 to try and > eliminate any mutation race conditions – did not seem to help) > > - sliced_buffer_size_in_kb: 128 > > - in_memory_compaction_limit_in_mb: 50 > > - rpc_timeout_in_ms: 15000 > > > > Schema for most problematic CF: > > name: DeviceEventsByDevice > > column_type: Standard > > memtable_throughput_in_mb: 150 > > memtable_operations_in_millions: 1.5 > > gc_grace_seconds: 172800 > > keys_cached: 1000000 > > rows_cached: 0 > > > > Dan Hendry > > (403) 660-2297 > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com