> We are starting use of Cassandra (version 2.0.1), and are doing system level > tests and have been running into a few issues with ss tables being corrupted > > The supposition is that these are caused by: > > https://issues.apache.org/jira/browse/CASSANDRA-5202 > > One example is a corrupted SSTable (note the full stack trace showed a read > path for a counter column (based on mask variable) and we don't have any > counter columns) > > Caused by: java.io.EOFException > at java.io.RandomAccessFile.readFully(RandomAccessFile.java:446) > at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424) > at > org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:348) > at > org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:392) > at > org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:355) > at > org.apache.cassandra.db.ColumnSerializer.deserializeColumnBody(ColumnSerializer.java:110) > at > org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:85) > at org.apache.cassandra.db.Column$1.computeNext(Column.java:75) > > We did have one problem that looked like it was caused by compaction during > CF delete, however I do not have that stack trace available at the moment > (though I will follow up with it, because it cause an assertion error on node > restart). That would probably be the only case we'd see in production, > because otherwise we'd never reuse CFs, we'd delete old ones contained > previous time windows. > > My more general question is what the philosophy of cassandra is to this > (IOError). Currently these cause timeouts. It looks from the code, than > certain code paths throw FSError which at least tries to deal with disk > failures, however, it is not unreasonable for the C* to not go out of its way > to make illegal states nice. > > That said, should this happen, we've seen really bad behavior. If this is a > corrupted sstable, it is unlikely that either dead node or dynamic snitch > will help. Clearly we need to configure our server timeouts lower and let the > client retry more aggressively. > > I guess my question is what are people doing in the real world (we're using > Asytanax in this case)… are you implementing app level stuff like maybe > falling back from (LOCAL_)QUORUM to ONE reads, making sure that ERROR level > on the server hits 24x7 support immediately or whatever?
smime.p7s
Description: S/MIME cryptographic signature