Forgot to mention: We also tried using IgniteLock::tryLock (with small timeout) instead of IgniteLock::lock and we observed no difference. Even with 'tryLock', Ignite gets stuck forever in the same busy-wait snippet ( https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L400-L401 ).
Em dom., 22 de fev. de 2026 às 17:38, Felipe Kersting < [email protected]> escreveu: > Hello devs! > > We are trying to adopt Ignite 2.17.0 in our Kubernetes application (to > replace another technology we were using), but we have been facing some > challenges. Our architecture is composed of a set of "backend pods," each > being an Ignite node, that together form our Ignite cluster. Alongside > those, we have a set of "frontend pods" that join the Ignite cluster as > thick clients and serve customer requests. To handle these requests, the > frontend pods extensively leverage Ignite cluster features (such as caches, > distributed locks, affinity calls, services, etc.). > > When pods are stable, our system generally behaves fine. However, we are > facing problems when we perform rolling upgrades in the Ignite cluster > (i.e., our "backend pods"). In this scenario, pods (and hence the JVM) are > terminated and new pods are created, one after another, in a rolling > fashion. We want to implement this in a way that is transparent to the > frontend pods—i.e., we don't want the frontend pods to face errors or > increased latency while the backend pods are being rolled. > > We have identified several issues, but I want to focus on only one of them > in this thread, as it is the one we have investigated most deeply. In this > case, a frontend pod faces a "deadlock" when trying to acquire a simple > lock if a specific backend pod is being terminated at the same time. > > Under normal behavior, we believe the lock acquisition flow works as > follows: > > 1. Node A calls ignite.reentrantLock(lockName, /* failoverSafe */ true, /* > fair */ false, /* create */ true).lock(). > 2. Node A commits a pessimistic transaction to acquire ownership of the > lock ( > https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L549). > The transaction is successful. > 3. Node A enters a busy wait, waiting for an "ack" message from another > node ("node B"; presumably the node that "controls" the lock) indicating > that the lock is acquired ( > https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L400-L401 > ). > 4. Node B sends this message through GridIoManager::send ( > https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/managers/communication/GridIoManager.java#L2011-L2022). > It is a bit hard for me to pinpoint the exact message type responsible for > acknowledging the lock acquisition, but from our logs it looks like it is a > message on TOPIC_CONTINUOUS. > 5. Node A receives the message from node B through the onUpdate handler ( > https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L1094 > ). > 6. Once processed, node A stops busy-waiting and the program continues > normally (with the lock successfully acquired). > > However, if a rolling upgrade is happening, node B might be terminated in > the middle of this flow, leading to a deadlock. In this case, we observe > the following behavior: > > 1. Node A calls ignite.reentrantLock(lockName, /* failoverSafe */ true, /* > fair */ false, /* create */ true).lock(). > 2. Node A commits a pessimistic transaction to acquire ownership of the > lock ( > https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L549). > The transaction is successful. > 3. Node A enters a busy wait, waiting for the "ack" message from node B ( > https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L400-L401 > ). > 4. Node B is stopped (via Ignition.stop(igniteName, /* cancel */ true)) > before it sends the message. > 5. The message is never sent, so node A remains in the busy wait forever. > A distributed deadlock occurs. > > In this scenario, we can see through debugging that the TOPIC_CONTINUOUS > message that usually unblocks node A is never received by node A. This > happens because it is never emitted by node B, since it was stopped before > it could send it (via Ignition.stop). > > Additional information: > - Our lock is always created with failoverSafe = true, but the issue > occurs nonetheless; > - This same scenario may also happen when unlocking the lock (i.e., in the > lock.unlock() call); > - This issue causes not only the thread that acquired the lock to be > stuck, but also all other threads in node A that try to acquire that same > lock, leading to complete system degradation; > - Ignite never recovers from this scenario; it stays in the busy wait > forever; > - We have tried using ShutdownPolicy IMMEDIATE and GRACEFUL; we saw no > change in behavior; > - Our cluster is deployed on Kubernetes, using a headless service for > communication and discovery. > > We are, and will remain, studying these issues to try to circumvent them. > We are wondering if anyone has a hypothesis for why this is happening and > what we can do to avoid it. In an ideal scenario, we would expect node B to > be capable of leaving the grid gracefully, completely transparent to node > A. Even if that ideal is not possible, it would be much less worrying if > Ignite threw an exception or otherwise failed fast—having it deadlock > indefinitely is very concerning. > > We are more than willing to gather extra logs or run any tests that might > help with the investigation. Any help is appreciated. > > Thank you! >
