Hello Raymond,

Do you have draft changes to look at?

I'd suggest a more general approach, as some interfaces seem to
overlap each other. There is the FSErrorHandler, and the
JVMStabilityInspector both of which are currently not configurable via
user configuration. I think it would be possible to have a public
interface for which users could configure their own handlers via
configuration:

public interface FailureHandler
{
    public boolean onFailure(Component type, FailureHandlerContext context);
}

It seems to me that the JVMStabilityInspector is a good candidate for
the default implementation of the FailureHandler API as it already
handles OOM, CommitLog errors, and disk errors as far as I can see.

On Sat, 16 Dec 2023 at 03:43, Josh McKenzie <jmcken...@apache.org> wrote:
>
> Adding a poison-pill error option on finding of corrupt data makes sense to 
> me. Not sure if there's enough demand / other customization being done in 
> this space to justify the user customizable aspect; any immediate other 
> approaches come to mind? If not, this isn't an area of the code that's 
> changed all that much, so just adding a new option seems surgical and minimal 
> to me.
>
> On Tue, Dec 12, 2023, at 4:21 AM, Claude Warren, Jr via dev wrote:
>
> I can see this as a strong improvement in Cassandra management and support it.
>
> +1 non binding
>
> On Mon, Dec 11, 2023 at 8:28 PM Raymond Huffman <raymondmhuff...@gmail.com> 
> wrote:
>
> Hello All,
>
> On our fork of Cassandra, we've implemented some custom behavior for handling 
> CommitLog and SSTable Corruption errors. Specifically, if a node detects one 
> of those errors, we want the node to stop itself, and if the node is 
> restarted, we want initialization to fail. This is useful in Kubernetes when 
> you expect nodes to be restarted frequently and makes our corruption 
> remediation workflows less error-prone. I think we could make this behavior 
> more pluggable by allowing users to provide custom implementations of the 
> FSErrorHandler, and the error handler that's currently implemented at 
> org.apache.cassandra.db.commitlog.CommitLog#handleCommitError via config in 
> the same way one can provide custom Partitioners and 
> Authenticators/Authorizers.
>
> Would you take as a contribution one of the following?
> 1. user provided implementations of FSErrorHandler and CommitLogErrorHandler, 
> set via config; and/or
> 2. new commit failure and disk failure policies that write a poison pill file 
> to disk and fail on startup if that file exists
>
> The poison pill implementation is what we currently use - we call this a "Non 
> Transient Error" and we want these states to always require manual 
> intervention to resolve, including manual action to clear the error. I'd be 
> happy to contribute this if other users would find it beneficial. I had 
> initially shared this question in Slack, but I'm now sharing it here for 
> broader visibility.
>
> -Raymond Huffman
>
>

Reply via email to