We are running a 30 node 1.0.5 cassandra cluster running RHEL 5.6 x86_64 virtualized on ESXi 5.0. We are seeing Decorated Key assertion error during compactions and at this point we are suspecting anything from OS/ESXi/HBA/iSCSI RAID. Please correct me i am wrong, once a node gets into this state I don't see any way to recover unless I remove the corrupted data file and restart cassandra. I am running tests with replication factor 3 and all reads and writes are done with QUORUM. So i believe there will not be data loss if i do this.
If this is a correct way to recover I would like to know how to gracefully do this in production environment.. - Disable thrift - Disable gossip - Drain the node - kill the cassandra java process ( send a sigterm and or sigkill ) - do a filesystem sync - remove the corrupted file from the /var/lib/cassandra/data directory - start cassandra - enable gossip so all pending hintedhandoff occurs - enable thrift. Thanks Ramesh