Hello All, On our fork of Cassandra, we've implemented some custom behavior for handling CommitLog and SSTable Corruption errors. Specifically, if a node detects one of those errors, we want the node to stop itself, and if the node is restarted, we want initialization to fail. This is useful in Kubernetes when you expect nodes to be restarted frequently and makes our corruption remediation workflows less error-prone. I think we could make this behavior more pluggable by allowing users to provide custom implementations of the FSErrorHandler, and the error handler that's currently implemented at org.apache.cassandra.db.commitlog.CommitLog#handleCommitError via config in the same way one can provide custom Partitioners and Authenticators/Authorizers.
Would you take as a contribution one of the following? 1. user provided implementations of FSErrorHandler and CommitLogErrorHandler, set via config; and/or 2. new commit failure and disk failure policies that write a poison pill file to disk and fail on startup if that file exists The poison pill implementation is what we currently use - we call this a "Non Transient Error" and we want these states to always require manual intervention to resolve, including manual action to clear the error. I'd be happy to contribute this if other users would find it beneficial. I had initially shared this question in Slack, but I'm now sharing it here for broader visibility. -Raymond Huffman