Hello All,

On our fork of Cassandra, we've implemented some custom behavior for
handling CommitLog and SSTable Corruption errors. Specifically, if a node
detects one of those errors, we want the node to stop itself, and if the
node is restarted, we want initialization to fail. This is useful in
Kubernetes when you expect nodes to be restarted frequently and makes our
corruption remediation workflows less error-prone. I think we could make
this behavior more pluggable by allowing users to provide custom
implementations of the FSErrorHandler, and the error handler that's
currently implemented at
org.apache.cassandra.db.commitlog.CommitLog#handleCommitError via config in
the same way one can provide custom Partitioners and
Authenticators/Authorizers.

Would you take as a contribution one of the following?
1. user provided implementations of FSErrorHandler and
CommitLogErrorHandler, set via config; and/or
2. new commit failure and disk failure policies that write a poison pill
file to disk and fail on startup if that file exists

The poison pill implementation is what we currently use - we call this a
"Non Transient Error" and we want these states to always require manual
intervention to resolve, including manual action to clear the error. I'd be
happy to contribute this if other users would find it beneficial. I had
initially shared this question in Slack, but I'm now sharing it here for
broader visibility.

-Raymond Huffman

Reply via email to