Hello devs! We are trying to adopt Ignite 2.17.0 in our Kubernetes application (to replace another technology we were using), but we have been facing some challenges. Our architecture is composed of a set of "backend pods," each being an Ignite node, that together form our Ignite cluster. Alongside those, we have a set of "frontend pods" that join the Ignite cluster as thick clients and serve customer requests. To handle these requests, the frontend pods extensively leverage Ignite cluster features (such as caches, distributed locks, affinity calls, services, etc.).
When pods are stable, our system generally behaves fine. However, we are facing problems when we perform rolling upgrades in the Ignite cluster (i.e., our "backend pods"). In this scenario, pods (and hence the JVM) are terminated and new pods are created, one after another, in a rolling fashion. We want to implement this in a way that is transparent to the frontend pods—i.e., we don't want the frontend pods to face errors or increased latency while the backend pods are being rolled. We have identified several issues, but I want to focus on only one of them in this thread, as it is the one we have investigated most deeply. In this case, a frontend pod faces a "deadlock" when trying to acquire a simple lock if a specific backend pod is being terminated at the same time. Under normal behavior, we believe the lock acquisition flow works as follows: 1. Node A calls ignite.reentrantLock(lockName, /* failoverSafe */ true, /* fair */ false, /* create */ true).lock(). 2. Node A commits a pessimistic transaction to acquire ownership of the lock ( https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L549). The transaction is successful. 3. Node A enters a busy wait, waiting for an "ack" message from another node ("node B"; presumably the node that "controls" the lock) indicating that the lock is acquired ( https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L400-L401 ). 4. Node B sends this message through GridIoManager::send ( https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/managers/communication/GridIoManager.java#L2011-L2022). It is a bit hard for me to pinpoint the exact message type responsible for acknowledging the lock acquisition, but from our logs it looks like it is a message on TOPIC_CONTINUOUS. 5. Node A receives the message from node B through the onUpdate handler ( https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L1094 ). 6. Once processed, node A stops busy-waiting and the program continues normally (with the lock successfully acquired). However, if a rolling upgrade is happening, node B might be terminated in the middle of this flow, leading to a deadlock. In this case, we observe the following behavior: 1. Node A calls ignite.reentrantLock(lockName, /* failoverSafe */ true, /* fair */ false, /* create */ true).lock(). 2. Node A commits a pessimistic transaction to acquire ownership of the lock ( https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L549). The transaction is successful. 3. Node A enters a busy wait, waiting for the "ack" message from node B ( https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L400-L401 ). 4. Node B is stopped (via Ignition.stop(igniteName, /* cancel */ true)) before it sends the message. 5. The message is never sent, so node A remains in the busy wait forever. A distributed deadlock occurs. In this scenario, we can see through debugging that the TOPIC_CONTINUOUS message that usually unblocks node A is never received by node A. This happens because it is never emitted by node B, since it was stopped before it could send it (via Ignition.stop). Additional information: - Our lock is always created with failoverSafe = true, but the issue occurs nonetheless; - This same scenario may also happen when unlocking the lock (i.e., in the lock.unlock() call); - This issue causes not only the thread that acquired the lock to be stuck, but also all other threads in node A that try to acquire that same lock, leading to complete system degradation; - Ignite never recovers from this scenario; it stays in the busy wait forever; - We have tried using ShutdownPolicy IMMEDIATE and GRACEFUL; we saw no change in behavior; - Our cluster is deployed on Kubernetes, using a headless service for communication and discovery. We are, and will remain, studying these issues to try to circumvent them. We are wondering if anyone has a hypothesis for why this is happening and what we can do to avoid it. In an ideal scenario, we would expect node B to be capable of leaving the grid gracefully, completely transparent to node A. Even if that ideal is not possible, it would be much less worrying if Ignite threw an exception or otherwise failed fast—having it deadlock indefinitely is very concerning. We are more than willing to gather extra logs or run any tests that might help with the investigation. Any help is appreciated. Thank you!
