[ https://issues.apache.org/jira/browse/CASSANDRA-19958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941945#comment-17941945 ]
Jaydeepkumar Chovatia commented on CASSANDRA-19958: --------------------------------------------------- There are no order warranties between the mutations and hints, so the proposed solution should be fine from a functional point of view. Additionally, we have been using the proposed solution in our fleet for quite some time without any issues. > Local Hints are stepping on local mutations > ------------------------------------------- > > Key: CASSANDRA-19958 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19958 > Project: Apache Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths > Reporter: Jaydeepkumar Chovatia > Assignee: Jaydeepkumar Chovatia > Priority: Normal > Attachments: image-2024-09-26-15-28-20-435.png > > > Cassandra uses the same queue (Stage.MUTATION) to process local mutations as > well as local hints writing. CASSANDRA-19534 has enhanced and added timeouts > for local mutations, but local hint writing does not honor that timeout by > design as it honors a different timeout, i.e. _max_hint_window_in_ms_ > > *The Problem* > Let's understand the problem by having five nodes Cassandra cluster N1, N2, > N3, N4, N5 with the following configuration: > * concurrent_writes{_}:{_}10 > * native_transport_timeout: 5s > * write_request_timeout_in_ms: 2000 //2 seconds > . > +StorageProxy.java snippet...+ > > !image-2024-09-26-15-28-20-435.png|width=600,height=200! > > Let's assume N4 and N5 are slow flapping or down. Assume N1 receives a flurry > of mutations, so this is what happens on N1: > # Line no 1542: Append 100 hints to the Stage.Mutation queue > # Line no 1547: Append 100 local mutations to the Stage.Mutation queue > Stage.Mutation queue on N1 would look as follows: > {code:java} > hint1,hint2,hint3,....hint100,mutation1,mutaiton2,....mutation100 {code} > * Assume hints runnable takes 1 second, then it will take 10 seconds to > process 100 hints, and only after that will local mutation be processed. > > So, in production, it would look like N1 is inactive for almost 10 seconds as > it is just writing hints locally and not participating in any Quorum, etc. > > The problem becomes really huge if, let's say, the load is high, and if hints > pile up to 1M, then N1 will choke. The only solution at this time is to > involve an operator that will restart N1 to drain all the piled-up hints from > the Stage.Mutation queue. > > The reason above problem happens is because local hint writing and local > mutation are both using the same Queue, i.e., Stage.Mutation. > Local mutation writing is in the hot path. However, a slight local hint > writing delay does not create a big trouble. > > *Reproducible steps* > # Pull the latest 4.1.x release > # Create a 5-node cluster > # Set the following configuration > {code:java} > native_transport_timeout: 10s > write_request_timeout_in_ms: 2000 > enforce_native_deadline_for_hints: true{code} > # Inject 1s of latency inside the following API in _StorageProxy.java_ on > all five nodes > # > {code:java} > private static void performLocally(Stage stage, Replica localReplica, final > Runnable runnable, final RequestCallback<?> handler, Object description, > Dispatcher.RequestTime requestTime) > { > stage.maybeExecuteImmediately(new LocalMutationRunnable(localReplica, > requestTime) > { > public void runMayThrow() > { > try > { > Thread.sleep(1000); // Inject latency here > runnable.run(); > handler.onResponse(null); > } > catch (Exception ex) > { > if (!(ex instanceof WriteTimeoutException)) > logger.error("Failed to apply mutation locally : ", ex); > handler.onFailure(FBUtilities.getBroadcastAddressAndPort(), > RequestFailureReason.forException(ex)); > } > } > @Override > public String description() > { > // description is an Object and toString() called so we do not > have to evaluate the Mutation.toString() > // unless expliclitly checked > return description.toString(); > } > @Override > protected Verb verb() > { > return Verb.MUTATION_REQ; > } > }); > } {code} > # Run write-only stress for 1 hour or so > # You will see Stage.Mutation queue will pile up to >1 million in size > # Stop the load > # Stage.Mutation will not be cleared immediately, and you cannot perform new > writes. Basically, at this time Cassandra cluster has become inoperable from > new mutations point-of-view. Only read will be served > > *Solution* > The solution is to segregate the local mutation queue and local hint writing > queue to address the problem above. Here is the PR: > [https://github.com/apache/cassandra/pull/3580] > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org