Hello devs!

We are trying to adopt Ignite 2.17.0 in our Kubernetes application (to
replace another technology we were using), but we have been facing some
challenges. Our architecture is composed of a set of "backend pods," each
being an Ignite node, that together form our Ignite cluster. Alongside
those, we have a set of "frontend pods" that join the Ignite cluster as
thick clients and serve customer requests. To handle these requests, the
frontend pods extensively leverage Ignite cluster features (such as caches,
distributed locks, affinity calls, services, etc.).

When pods are stable, our system generally behaves fine. However, we are
facing problems when we perform rolling upgrades in the Ignite cluster
(i.e., our "backend pods"). In this scenario, pods (and hence the JVM) are
terminated and new pods are created, one after another, in a rolling
fashion. We want to implement this in a way that is transparent to the
frontend pods—i.e., we don't want the frontend pods to face errors or
increased latency while the backend pods are being rolled.

We have identified several issues, but I want to focus on only one of them
in this thread, as it is the one we have investigated most deeply. In this
case, a frontend pod faces a "deadlock" when trying to acquire a simple
lock if a specific backend pod is being terminated at the same time.

Under normal behavior, we believe the lock acquisition flow works as
follows:

1. Node A calls ignite.reentrantLock(lockName, /* failoverSafe */ true, /*
fair */ false, /* create */ true).lock().
2. Node A commits a pessimistic transaction to acquire ownership of the
lock (
https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L549).
The transaction is successful.
3. Node A enters a busy wait, waiting for an "ack" message from another
node ("node B"; presumably the node that "controls" the lock) indicating
that the lock is acquired (
https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L400-L401
).
4. Node B sends this message through GridIoManager::send (
https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/managers/communication/GridIoManager.java#L2011-L2022).
It is a bit hard for me to pinpoint the exact message type responsible for
acknowledging the lock acquisition, but from our logs it looks like it is a
message on TOPIC_CONTINUOUS.
5. Node A receives the message from node B through the onUpdate handler (
https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L1094
).
6. Once processed, node A stops busy-waiting and the program continues
normally (with the lock successfully acquired).

However, if a rolling upgrade is happening, node B might be terminated in
the middle of this flow, leading to a deadlock. In this case, we observe
the following behavior:

1. Node A calls ignite.reentrantLock(lockName, /* failoverSafe */ true, /*
fair */ false, /* create */ true).lock().
2. Node A commits a pessimistic transaction to acquire ownership of the
lock (
https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L549).
The transaction is successful.
3. Node A enters a busy wait, waiting for the "ack" message from node B (
https://github.com/apache/ignite/blob/d53d4540b64bb4a25f1a5a1003c4338727628c9f/modules/core/src/main/java/org/apache/ignite/internal/processors/datastructures/GridCacheLockImpl.java#L400-L401
).
4. Node B is stopped (via Ignition.stop(igniteName, /* cancel */ true))
before it sends the message.
5. The message is never sent, so node A remains in the busy wait forever. A
distributed deadlock occurs.

In this scenario, we can see through debugging that the TOPIC_CONTINUOUS
message that usually unblocks node A is never received by node A. This
happens because it is never emitted by node B, since it was stopped before
it could send it (via Ignition.stop).

Additional information:
- Our lock is always created with failoverSafe = true, but the issue occurs
nonetheless;
- This same scenario may also happen when unlocking the lock (i.e., in the
lock.unlock() call);
- This issue causes not only the thread that acquired the lock to be stuck,
but also all other threads in node A that try to acquire that same lock,
leading to complete system degradation;
- Ignite never recovers from this scenario; it stays in the busy wait
forever;
- We have tried using ShutdownPolicy IMMEDIATE and GRACEFUL; we saw no
change in behavior;
- Our cluster is deployed on Kubernetes, using a headless service for
communication and discovery.

We are, and will remain, studying these issues to try to circumvent them.
We are wondering if anyone has a hypothesis for why this is happening and
what we can do to avoid it. In an ideal scenario, we would expect node B to
be capable of leaving the grid gracefully, completely transparent to node
A. Even if that ideal is not possible, it would be much less worrying if
Ignite threw an exception or otherwise failed fast—having it deadlock
indefinitely is very concerning.

We are more than willing to gather extra logs or run any tests that might
help with the investigation. Any help is appreciated.

Thank you!

Reply via email to