Re: Help requested - Deadlock while acquiring distributed lock during cluster rolling upgrade

Felipe Kersting Mon, 23 Feb 2026 11:42:17 -0800

Forgot to mention: We also tried using IgniteLock::tryLock (with small
timeout) instead of IgniteLock::lock and we observed no difference. Even
with 'tryLock', Ignite gets stuck forever in the same busy-wait snippet (
https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L400-L401
).


Em dom., 22 de fev. de 2026 às 17:38, Felipe Kersting <
[email protected]> escreveu:

> Hello devs!
>
> We are trying to adopt Ignite 2.17.0 in our Kubernetes application (to
> replace another technology we were using), but we have been facing some
> challenges. Our architecture is composed of a set of "backend pods," each
> being an Ignite node, that together form our Ignite cluster. Alongside
> those, we have a set of "frontend pods" that join the Ignite cluster as
> thick clients and serve customer requests. To handle these requests, the
> frontend pods extensively leverage Ignite cluster features (such as caches,
> distributed locks, affinity calls, services, etc.).
>
> When pods are stable, our system generally behaves fine. However, we are
> facing problems when we perform rolling upgrades in the Ignite cluster
> (i.e., our "backend pods"). In this scenario, pods (and hence the JVM) are
> terminated and new pods are created, one after another, in a rolling
> fashion. We want to implement this in a way that is transparent to the
> frontend pods—i.e., we don't want the frontend pods to face errors or
> increased latency while the backend pods are being rolled.
>
> We have identified several issues, but I want to focus on only one of them
> in this thread, as it is the one we have investigated most deeply. In this
> case, a frontend pod faces a "deadlock" when trying to acquire a simple
> lock if a specific backend pod is being terminated at the same time.
>
> Under normal behavior, we believe the lock acquisition flow works as
> follows:
>
> 1. Node A calls ignite.reentrantLock(lockName, /* failoverSafe */ true, /*
> fair */ false, /* create */ true).lock().
> 2. Node A commits a pessimistic transaction to acquire ownership of the
> lock (
> https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L549).
> The transaction is successful.
> 3. Node A enters a busy wait, waiting for an "ack" message from another
> node ("node B"; presumably the node that "controls" the lock) indicating
> that the lock is acquired (
> https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L400-L401
> ).
> 4. Node B sends this message through GridIoManager::send (
> https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/managers/communication/GridIoManager.java#L2011-L2022).
> It is a bit hard for me to pinpoint the exact message type responsible for
> acknowledging the lock acquisition, but from our logs it looks like it is a
> message on TOPIC_CONTINUOUS.
> 5. Node A receives the message from node B through the onUpdate handler (
> https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L1094
> ).
> 6. Once processed, node A stops busy-waiting and the program continues
> normally (with the lock successfully acquired).
>
> However, if a rolling upgrade is happening, node B might be terminated in
> the middle of this flow, leading to a deadlock. In this case, we observe
> the following behavior:
>
> 1. Node A calls ignite.reentrantLock(lockName, /* failoverSafe */ true, /*
> fair */ false, /* create */ true).lock().
> 2. Node A commits a pessimistic transaction to acquire ownership of the
> lock (
> https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L549).
> The transaction is successful.
> 3. Node A enters a busy wait, waiting for the "ack" message from node B (
> https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L400-L401
> ).
> 4. Node B is stopped (via Ignition.stop(igniteName, /* cancel */ true))
> before it sends the message.
> 5. The message is never sent, so node A remains in the busy wait forever.
> A distributed deadlock occurs.
>
> In this scenario, we can see through debugging that the TOPIC_CONTINUOUS
> message that usually unblocks node A is never received by node A. This
> happens because it is never emitted by node B, since it was stopped before
> it could send it (via Ignition.stop).
>
> Additional information:
> - Our lock is always created with failoverSafe = true, but the issue
> occurs nonetheless;
> - This same scenario may also happen when unlocking the lock (i.e., in the
> lock.unlock() call);
> - This issue causes not only the thread that acquired the lock to be
> stuck, but also all other threads in node A that try to acquire that same
> lock, leading to complete system degradation;
> - Ignite never recovers from this scenario; it stays in the busy wait
> forever;
> - We have tried using ShutdownPolicy IMMEDIATE and GRACEFUL; we saw no
> change in behavior;
> - Our cluster is deployed on Kubernetes, using a headless service for
> communication and discovery.
>
> We are, and will remain, studying these issues to try to circumvent them.
> We are wondering if anyone has a hypothesis for why this is happening and
> what we can do to avoid it. In an ideal scenario, we would expect node B to
> be capable of leaving the grid gracefully, completely transparent to node
> A. Even if that ideal is not possible, it would be much less worrying if
> Ignite threw an exception or otherwise failed fast—having it deadlock
> indefinitely is very concerning.
>
> We are more than willing to gather extra logs or run any tests that might
> help with the investigation. Any help is appreciated.
>
> Thank you!
>

Re: Help requested - Deadlock while acquiring distributed lock during cluster rolling upgrade

Reply via email to