Hi Dan,

it would be very useful to test with 0.7 branch instead of 0.7.0 so at
least you're not chasing known and fixed bugs like CASSANDRA-1992.

As you say, there's a lot of people who aren't seeing this, so it
would also be useful if you can provide some kind of test harness
where you can say "point this at a cluster and within a few hours

On Wed, Feb 9, 2011 at 4:31 PM, Dan Hendry <dan.hendry.j...@gmail.com> wrote:
> I have been having SEVERE data corruption issues with SSTables in my
> cluster, for one CF it was happening almost daily (I have since shut down
> the service using that CF as it was too much work to manage the Cassandra
> errors). At this point, I can’t see how it is anything but a Cassandra bug
> yet it’s somewhat strange and very scary that I am the only one who seems to
> be having such serious issues. Most of my data is indexed in two ways so I
> have been able to write a validator which goes through and back fills
> missing data but it’s kind of defeating the whole point of Cassandra. The
> only way I have found to deal with issues when they crop up to prevent nodes
> crashing from repeated failed compactions is delete the SSTable. My cluster
> is running a slightly modified 0.7.0 version which logs what files errors
> for so that I can stop the node and delete them.
>
>
>
> The problem:
>
> -          Reads, compactions and hinted handoff fail with various
> exceptions (samples shown at the end of this email) which seem to indicate
> sstable corruption.
>
> -          I have seen failed reads/compactions/hinted handoff on 4 out of 4
> nodes (RF=2) for 3 different super column families and 1 standard column
> family (4 out of 11) and just now, the Hints system CF. (if it matters the
> ring has not changed since one CF which has been giving me trouble was
> created). I have check SMART disk info and run various diagnostics and there
> does not seem to be any hardware issues, plus what are the chances of all
> four nodes having the same hardware problems at the same time when for all
> other purposes, they appear fine?
>
> -          I have added logging which outputs what sstable are causing
> exceptions to be thrown. The corrupt sstables have been both freshly flushed
> memtables and the output of compaction (ie, 4 sstables which all seem to be
> fine get compacted to 1 which is then corrupt). It seems that the majority
> of corrupt sstables are post-compacted (vs post-memtable flush).
>
> -          The one CF which was giving me the most problems was heavily
> written to (1000-1500 writes/second continually across the cluster). For
> that cf, was having to deleting 4-6 sstables a day across the cluster (and
> the number was going up, even the number of problems for remaining CFs is
> going up). The other CFs which have had corrupt sstables are also quite
> heavily written to (generally a few hundred writes a second across the
> cluster).
>
> -          Most of the time (5/6 attempts) when this problem occurs,
> sstable2json also fails. I have however, had one case where I was able to
> export the sstable to json, then re-import it at which point I was no longer
> seeing exceptions.
>
> -          The cluster has been running for a little over 2 months now,
> problem seems to have sprung up in the last 3-4 weeks and seems to be
> steadily getting worse.
>
>
>
> Ultimately, I think I am hitting some subtle race condition somewhere. I
> have been starting to dig into the Cassandra code but I barely know where to
> start looking. I realize I have not provided nearly enough information to
> easily debug the problem but PLEASE keep your eyes open for possibly racy or
> buggy code which could cause these sorts of problems. I am willing to
> provided full Cassandra logs and a corrupt SSTable on an individual basis:
> please email me and let me know.
>
>
>
> Here is possibly relevant information and my theories on a possible root
> cause. Again, I know little about the Cassandra code base and have only
> moderate java experience so these theories may be way off base.
>
> -          Strictly speaking, I probably don’t have enough memory for my
> workload. I see stop the world gc occurring ~30/day/node, often causing
> Cassandra to hang for 30+ seconds (according to the gc logs). Could there be
> some java bug where a full gc in the middle of writing or flushing
> (compaction/memtable flush) or doing some other disk based activity causes
> some sort of data corruption?
>
> -          Writes are usually done at ConsistencyLevel ONE with additional
> client side retry logic. Given that I often see consecutive nodes in the
> ring down, could there be some edge condition where dying at just the right
> time causes parts of mutations/messages to be lost?
>
> -          All of the CFs which have been causing me problems have large
> rows which are compacted incrementally. Could there be some problem with the
> incremental compaction logic?
>
> -          My cluster has a fairly heavy write load (again, the most
> problematic CF is getting 1500 (w/s)/(RF=2) = 750 writes/second/node).
> Furthermore, it is highly probable that there are timestamp collisions.
> Could there be some issue with timestamp logic (ie, using > instead of >= or
> some such) during flushes/compaction?
>
> -          Once a node
>
>
>
> Cluster/system information:
>
> -          4 nodes with RF=2
>
> -          Nodes have 8 cores with 24 GB of RAM a piece.
>
> -          2 HDs, 1 for commit log/system, 1 for /var/lib/cassandra/data
>
> -          OS is Ubuntu 10.04 (uname –r = 2.6.32-24-server)
>
> -          Java:
>
> o   java version "1.6.0_22"
>
> o   Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
>
> o   Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
>
> -          Slightly modified (file information in exceptions) version of
> 0.7.0
>
>
>
> The following non-standard cassandra.yaml properties have been changed:
>
> -          commitlog_sync_period_in_ms: 100 (with commitlog_sync: periodic)
>
> -          disk_access_mode: mmap_index_only
>
> -          concurrent_reads: 12
>
> -          concurrent_writes: 2 (was 32, but I dropped it to 2 to try and
> eliminate any mutation race conditions – did not seem to help)
>
> -          sliced_buffer_size_in_kb: 128
>
> -          in_memory_compaction_limit_in_mb: 50
>
> -          rpc_timeout_in_ms: 15000
>
>
>
> Schema for most problematic CF:
>
> name: DeviceEventsByDevice
>
> column_type: Standard
>
> memtable_throughput_in_mb: 150
>
> memtable_operations_in_millions: 1.5
>
> gc_grace_seconds: 172800
>
> keys_cached: 1000000
>
> rows_cached: 0
>
>
>
> Dan Hendry
>
> (403) 660-2297
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Reply via email to