> We are starting use of Cassandra (version 2.0.1), and are doing system level 
> tests and have been running into a few issues with ss tables being corrupted
> 
> The supposition is that these are caused by:
> 
> https://issues.apache.org/jira/browse/CASSANDRA-5202
> 
> One example is a corrupted SSTable (note the full stack trace showed a read 
> path for a counter column (based on mask variable) and we don't have any 
> counter columns)
>  
> Caused by: java.io.EOFException
>         at java.io.RandomAccessFile.readFully(RandomAccessFile.java:446)
>         at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424)
>         at 
> org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:348)
>         at 
> org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:392)
>         at 
> org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:355)
>         at 
> org.apache.cassandra.db.ColumnSerializer.deserializeColumnBody(ColumnSerializer.java:110)
>         at 
> org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:85)
>         at org.apache.cassandra.db.Column$1.computeNext(Column.java:75)
> 
> We did have one problem that looked like it was caused by compaction during 
> CF delete, however I do not have that stack trace available at the moment 
> (though I will follow up with it, because it cause an assertion error on node 
> restart). That would probably be the only case we'd see in production, 
> because otherwise we'd never reuse CFs, we'd delete old ones contained 
> previous time windows.
> 
> My more general question is what the philosophy of cassandra is to this 
> (IOError). Currently these cause timeouts. It looks from the code, than 
> certain code paths throw FSError which at least tries to deal with disk 
> failures, however, it is not unreasonable for the C* to not go out of its way 
> to make illegal states nice.
> 
> That said, should this happen, we've seen really bad behavior. If this is a 
> corrupted sstable, it is unlikely that either dead node or dynamic snitch 
> will help. Clearly we need to configure our server timeouts lower and let the 
> client retry more aggressively.
> 
> I guess my question is what are people doing in the real world (we're using 
> Asytanax in this case)… are you implementing app level stuff like maybe 
> falling back from (LOCAL_)QUORUM to ONE reads, making sure that ERROR level 
> on the server hits 24x7 support immediately or whatever?

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to