Look for the code generating the rebalancing status messages in the logs. Once you become familiar with that, the solution will be clear.
On Fri, Mar 6, 2026, 14:08 Felipe Kersting <[email protected]> wrote: > Hi Jeremy, > > Thanks for the reply! > > We have full control over the rolling upgrade process. We roll only one > pod at a time. A pod is only allowed to shut down after it has successfully > left the Ignite grid. Likewise, a new pod is only marked as ready, allowing > the rollout to proceed, once it has successfully joined the grid. > > During the bootstrap of a new pod, we simply call `Ignition.start(cfg)` > and wait for it to complete. The rollout only continues after this call > finishes successfully. > > When the service is started from scratch, we also have additional logic to > ensure that we only activate the cluster > (`igniteClient.cluster().state(ClusterState.ACTIVE)`) after all members > have joined the grid. That said, I believe this is orthogonal to the > current discussion, since during rolling upgrades the cluster is already in > the `ACTIVE` state. > > During pod shutdown, we rely on `Ignition.stop(cancel=true)`. We invoke it > synchronously and wait for it to complete before allowing the pod to be > deleted. > > In addition, all of our caches are configured with backups. By ensuring > that only one pod is deleted at a time, we try to guarantee that there is > always a backup available to take over as the new primary. This seems to > work in general, as we can verify that when backups are not configured, the > rollout consistently results in loss of state. > > Please also note that, although we do observe transient > ClusterTopologyException errors during the rollout, we do not actually lose > cache data. Once the rollout settles, the data stored in the affected > caches is always still available. > > Even though we do control the full rollout process, we do not explicitly > wait for the topology to become "settled," as you suggested. Do you have > any examples or guidance on which Ignite APIs we could use during pod > startup or shutdown to determine when it is safe to proceed? > > Thank you! > Felipe > > Em sex., 6 de mar. de 2026 às 13:17, Jeremy McMillan <[email protected]> > escreveu: > >> A) If there is never any partition loss, then we assume all of the data >> is intact. >> B) Topology changes are disruptive. These messages are a warning that you >> are pushing your cluster's ability to maintain the topology and flirting >> with partition loss. >> >> If you have decided to accept these kinds of warnings, you have left the >> world where guarantees mean anything. Maybe you should slow down your >> rolling restart. Try the operator pattern so that Kubernetes isn't taking >> the next node out of the topology before the topology has settled from the >> prior step. Maybe implement a thin client that executes a Kubernetes >> operation while listening for remote Ignite events to confirm the operation >> has succeeded to perform the rolling restart. Please share your code! >> >> On Thu, Mar 5, 2026 at 10:20 AM Felipe Kersting <[email protected]> >> wrote: >> >>> Hello Ignite devs, >>> >>> We are in the process of introducing Apache Ignite into our application >>> (replacing another technology) and are currently testing our rollout >>> strategy. >>> >>> During a rollout, Ignite server nodes are terminated and new nodes are >>> started one after another (Kubernetes-style rolling update). As a result, >>> nodes leave and join the cluster continuously. At the moment we are testing >>> a pure in-memory deployment (no persistence / no baseline topology >>> configured). >>> >>> While running these tests, we noticed that thick clients commonly hit >>> `ClusterTopologyException` during the rollout—most often when interacting >>> with caches (typically wrapped in `CacheException`). We have also seen >>> other rollout-related issues (including the deadlock previously discussed >>> in this thread), but this email focuses specifically on >>> `ClusterTopologyException`. >>> >>> The documentation suggests that callers should "wait on the future and >>> use retry logic": >>> [ >>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions](https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions) >>> >>> In our case, the future embedded in the exception is frequently `null`, >>> so we implemented a retry layer that retries cache operations with backoff >>> whenever `ClusterTopologyException` is thrown. This seems to keep the >>> client stable during rollouts, though at the cost of extra latency. >>> >>> Our question is about correctness / idempotency: is it safe to blindly >>> retry cache operations when `ClusterTopologyException` occurs? >>> >>> In particular, we are concerned about the following operations: >>> >>> * `IgniteCache::putAll` >>> * `IgniteCache::clear` >>> * `IgniteCache::removeAll` >>> * `IgniteCache::forEach` >>> * `IgniteCache::invoke` >>> * `IgniteCache::invokeAll` >>> >>> For example: >>> >>> * If `ClusterTopologyException` is thrown from `IgniteCache::forEach`, >>> is it guaranteed that the operation was not executed for any key, or can it >>> be partially executed for a subset of keys? >>> * Likewise for `invoke` / `invokeAll`: is it guaranteed that the >>> `EntryProcessor` was not executed at all, or could it have been executed >>> (fully or partially) before the exception was surfaced to the client? >>> >>> If partial execution is possible, then a blind retry could result in >>> duplicate effects for an arbitrary subset of keys, which could be >>> problematic depending on the operation semantics. >>> >>> Any guidance on the expected guarantees here (or best practices for >>> designing a safe retry strategy in this scenario) would be greatly >>> appreciated. >>> >>> Thank you, >>> Felipe >>> >>
