[ https://issues.apache.org/jira/browse/IGNITE-21062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Lapin updated IGNITE-21062: ------------------------------------- Reviewer: Ivan Bessonov > Safe time reordering in partitions > ---------------------------------- > > Key: IGNITE-21062 > URL: https://issues.apache.org/jira/browse/IGNITE-21062 > Project: Ignite > Issue Type: Bug > Reporter: Ivan Bessonov > Assignee: Alexander Lapin > Priority: Major > Labels: ignite-3 > Time Spent: 1h 10m > Remaining Estimate: 0h > > In the scenario of creating a lot of table and having slow system > (presumably), it's possible to notice {{Safe time reordering detected > [current=...}} assertion error in logs. > It happens with safe-time sync commands, in the absence of transactional load. > h3. UPD #1 > Following steps will bring us to the "Safe time reordering detected" problem. > 0. PartitionReplicaListener and PartitionListener are located on the same > Ignite node. No serialization/deserialization within raft itself. > maxObservableSafeTime = -1, maxObservableSafeTimeVerifier = -1. > 1. SafeTimePropagatingCommand with safeTime = 10 successfully passes > maxObservableSafeTime gateway, messaging thread is free, raft thread handles > the command and hangs before stateMachine.onWrite() -> maxObservableSafeTime > = 10, maxObservableSafeTimeVerifier = -1 > 2. Client for some reason (e.g. TimeoutException) re-send same command that > will be rolled back because both command.safeTime and maxObservableSafeTime > == 10. > 3. Client updates the command with new safeTime (e.g. 20) which also will > modify "initial command" from the step 1. > 4. Raft thread from the step 1 apply unintentionally updated command to a > state machine -> maxObservableSafeTime = 10, maxObservableSafeTimeVerifier = > 20 (not expected, should be the same as in maxObservableSafeTime). > 5. Retry command from step 3 successfully passes the gateway and face an > assertion on maxObservableSafeTimeVerifier == command.safeTime > Rather often there was exact safeTime matching in logs, that proves given > explanation. > {code:java} > [2023-11-29T14:27:05,893][ERROR][%irdt_tdpsoen_20002%JRaft-FSMCaller-Disruptor-_stripe_7-0][StripedDisruptor] > Handle disruptor event error > [name=%irdt_tdpsoen_20002%JRaft-FSMCaller-Disruptor-, > event=org.apache.ignite.raft.jraft.core.FSMCallerImpl$ApplyTask@2e6cd30d, > hasHandler=false] > java.lang.AssertionError: Safe time reordering detected > [current=111494301331423233, proposed=111494301331423233] > All in all, that means, that messages should be immutable.{code} > * Initially, to verify that aforementioned explanation is correct, it's > possible to clone the message and update the cloned copy. > * However, it's better to exclude the ability to call set<> on message > itself. Meaning that some sort of factory method is much more reliable here. > That will be covered in a separate ticket. > -- This message was sent by Atlassian Jira (v8.20.10#820010)