Look for the code generating the rebalancing status messages in the logs.
Once you become familiar with that, the solution will be clear.

On Fri, Mar 6, 2026, 14:08 Felipe Kersting <[email protected]> wrote:

> Hi Jeremy,
>
> Thanks for the reply!
>
> We have full control over the rolling upgrade process. We roll only one
> pod at a time. A pod is only allowed to shut down after it has successfully
> left the Ignite grid. Likewise, a new pod is only marked as ready, allowing
> the rollout to proceed, once it has successfully joined the grid.
>
> During the bootstrap of a new pod, we simply call `Ignition.start(cfg)`
> and wait for it to complete. The rollout only continues after this call
> finishes successfully.
>
> When the service is started from scratch, we also have additional logic to
> ensure that we only activate the cluster
> (`igniteClient.cluster().state(ClusterState.ACTIVE)`) after all members
> have joined the grid. That said, I believe this is orthogonal to the
> current discussion, since during rolling upgrades the cluster is already in
> the `ACTIVE` state.
>
> During pod shutdown, we rely on `Ignition.stop(cancel=true)`. We invoke it
> synchronously and wait for it to complete before allowing the pod to be
> deleted.
>
> In addition, all of our caches are configured with backups. By ensuring
> that only one pod is deleted at a time, we try to guarantee that there is
> always a backup available to take over as the new primary. This seems to
> work in general, as we can verify that when backups are not configured, the
> rollout consistently results in loss of state.
>
> Please also note that, although we do observe transient
> ClusterTopologyException errors during the rollout, we do not actually lose
> cache data. Once the rollout settles, the data stored in the affected
> caches is always still available.
>
> Even though we do control the full rollout process, we do not explicitly
> wait for the topology to become "settled," as you suggested. Do you have
> any examples or guidance on which Ignite APIs we could use during pod
> startup or shutdown to determine when it is safe to proceed?
>
> Thank you!
> Felipe
>
> Em sex., 6 de mar. de 2026 às 13:17, Jeremy McMillan <[email protected]>
> escreveu:
>
>> A) If there is never any partition loss, then we assume all of the data
>> is intact.
>> B) Topology changes are disruptive. These messages are a warning that you
>> are pushing your cluster's ability to maintain the topology and flirting
>> with partition loss.
>>
>> If you have decided to accept these kinds of warnings, you have left the
>> world where guarantees mean anything. Maybe you should slow down your
>> rolling restart. Try the operator pattern so that Kubernetes isn't taking
>> the next node out of the topology before the topology has settled from the
>> prior step. Maybe implement a thin client that executes a Kubernetes
>> operation while listening for remote Ignite events to confirm the operation
>> has succeeded to perform the rolling restart. Please share your code!
>>
>> On Thu, Mar 5, 2026 at 10:20 AM Felipe Kersting <[email protected]>
>> wrote:
>>
>>> Hello Ignite devs,
>>>
>>> We are in the process of introducing Apache Ignite into our application
>>> (replacing another technology) and are currently testing our rollout
>>> strategy.
>>>
>>> During a rollout, Ignite server nodes are terminated and new nodes are
>>> started one after another (Kubernetes-style rolling update). As a result,
>>> nodes leave and join the cluster continuously. At the moment we are testing
>>> a pure in-memory deployment (no persistence / no baseline topology
>>> configured).
>>>
>>> While running these tests, we noticed that thick clients commonly hit
>>> `ClusterTopologyException` during the rollout—most often when interacting
>>> with caches (typically wrapped in `CacheException`). We have also seen
>>> other rollout-related issues (including the deadlock previously discussed
>>> in this thread), but this email focuses specifically on
>>> `ClusterTopologyException`.
>>>
>>> The documentation suggests that callers should "wait on the future and
>>> use retry logic":
>>> [
>>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions](https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions)
>>>
>>> In our case, the future embedded in the exception is frequently `null`,
>>> so we implemented a retry layer that retries cache operations with backoff
>>> whenever `ClusterTopologyException` is thrown. This seems to keep the
>>> client stable during rollouts, though at the cost of extra latency.
>>>
>>> Our question is about correctness / idempotency: is it safe to blindly
>>> retry cache operations when `ClusterTopologyException` occurs?
>>>
>>> In particular, we are concerned about the following operations:
>>>
>>> * `IgniteCache::putAll`
>>> * `IgniteCache::clear`
>>> * `IgniteCache::removeAll`
>>> * `IgniteCache::forEach`
>>> * `IgniteCache::invoke`
>>> * `IgniteCache::invokeAll`
>>>
>>> For example:
>>>
>>> * If `ClusterTopologyException` is thrown from `IgniteCache::forEach`,
>>> is it guaranteed that the operation was not executed for any key, or can it
>>> be partially executed for a subset of keys?
>>> * Likewise for `invoke` / `invokeAll`: is it guaranteed that the
>>> `EntryProcessor` was not executed at all, or could it have been executed
>>> (fully or partially) before the exception was surfaced to the client?
>>>
>>> If partial execution is possible, then a blind retry could result in
>>> duplicate effects for an arbitrary subset of keys, which could be
>>> problematic depending on the operation semantics.
>>>
>>> Any guidance on the expected guarantees here (or best practices for
>>> designing a safe retry strategy in this scenario) would be greatly
>>> appreciated.
>>>
>>> Thank you,
>>> Felipe
>>>
>>

Reply via email to