Hello Raymond, Do you have draft changes to look at?
I'd suggest a more general approach, as some interfaces seem to overlap each other. There is the FSErrorHandler, and the JVMStabilityInspector both of which are currently not configurable via user configuration. I think it would be possible to have a public interface for which users could configure their own handlers via configuration: public interface FailureHandler { public boolean onFailure(Component type, FailureHandlerContext context); } It seems to me that the JVMStabilityInspector is a good candidate for the default implementation of the FailureHandler API as it already handles OOM, CommitLog errors, and disk errors as far as I can see. On Sat, 16 Dec 2023 at 03:43, Josh McKenzie <jmcken...@apache.org> wrote: > > Adding a poison-pill error option on finding of corrupt data makes sense to > me. Not sure if there's enough demand / other customization being done in > this space to justify the user customizable aspect; any immediate other > approaches come to mind? If not, this isn't an area of the code that's > changed all that much, so just adding a new option seems surgical and minimal > to me. > > On Tue, Dec 12, 2023, at 4:21 AM, Claude Warren, Jr via dev wrote: > > I can see this as a strong improvement in Cassandra management and support it. > > +1 non binding > > On Mon, Dec 11, 2023 at 8:28 PM Raymond Huffman <raymondmhuff...@gmail.com> > wrote: > > Hello All, > > On our fork of Cassandra, we've implemented some custom behavior for handling > CommitLog and SSTable Corruption errors. Specifically, if a node detects one > of those errors, we want the node to stop itself, and if the node is > restarted, we want initialization to fail. This is useful in Kubernetes when > you expect nodes to be restarted frequently and makes our corruption > remediation workflows less error-prone. I think we could make this behavior > more pluggable by allowing users to provide custom implementations of the > FSErrorHandler, and the error handler that's currently implemented at > org.apache.cassandra.db.commitlog.CommitLog#handleCommitError via config in > the same way one can provide custom Partitioners and > Authenticators/Authorizers. > > Would you take as a contribution one of the following? > 1. user provided implementations of FSErrorHandler and CommitLogErrorHandler, > set via config; and/or > 2. new commit failure and disk failure policies that write a poison pill file > to disk and fail on startup if that file exists > > The poison pill implementation is what we currently use - we call this a "Non > Transient Error" and we want these states to always require manual > intervention to resolve, including manual action to clear the error. I'd be > happy to contribute this if other users would find it beneficial. I had > initially shared this question in Slack, but I'm now sharing it here for > broader visibility. > > -Raymond Huffman > >