Sorry, my morning coffee isn't working yet. You aren't making permanent topology changes, so ignore rebalancing.
For rolling restarts follow the discovery events. https://www.gridgain.com/docs/gridgain8/latest/developers-guide/events/events#discovery-events On Tue, Mar 10, 2026, 08:31 Jeremy McMillan <[email protected]> wrote: > Look for the code generating the rebalancing status messages in the logs. > Once you become familiar with that, the solution will be clear. > > On Fri, Mar 6, 2026, 14:08 Felipe Kersting <[email protected]> > wrote: > >> Hi Jeremy, >> >> Thanks for the reply! >> >> We have full control over the rolling upgrade process. We roll only one >> pod at a time. A pod is only allowed to shut down after it has successfully >> left the Ignite grid. Likewise, a new pod is only marked as ready, allowing >> the rollout to proceed, once it has successfully joined the grid. >> >> During the bootstrap of a new pod, we simply call `Ignition.start(cfg)` >> and wait for it to complete. The rollout only continues after this call >> finishes successfully. >> >> When the service is started from scratch, we also have additional logic >> to ensure that we only activate the cluster >> (`igniteClient.cluster().state(ClusterState.ACTIVE)`) after all members >> have joined the grid. That said, I believe this is orthogonal to the >> current discussion, since during rolling upgrades the cluster is already in >> the `ACTIVE` state. >> >> During pod shutdown, we rely on `Ignition.stop(cancel=true)`. We invoke >> it synchronously and wait for it to complete before allowing the pod to be >> deleted. >> >> In addition, all of our caches are configured with backups. By ensuring >> that only one pod is deleted at a time, we try to guarantee that there is >> always a backup available to take over as the new primary. This seems to >> work in general, as we can verify that when backups are not configured, the >> rollout consistently results in loss of state. >> >> Please also note that, although we do observe transient >> ClusterTopologyException errors during the rollout, we do not actually lose >> cache data. Once the rollout settles, the data stored in the affected >> caches is always still available. >> >> Even though we do control the full rollout process, we do not explicitly >> wait for the topology to become "settled," as you suggested. Do you have >> any examples or guidance on which Ignite APIs we could use during pod >> startup or shutdown to determine when it is safe to proceed? >> >> Thank you! >> Felipe >> >> Em sex., 6 de mar. de 2026 às 13:17, Jeremy McMillan <[email protected]> >> escreveu: >> >>> A) If there is never any partition loss, then we assume all of the data >>> is intact. >>> B) Topology changes are disruptive. These messages are a warning that >>> you are pushing your cluster's ability to maintain the topology and >>> flirting with partition loss. >>> >>> If you have decided to accept these kinds of warnings, you have left the >>> world where guarantees mean anything. Maybe you should slow down your >>> rolling restart. Try the operator pattern so that Kubernetes isn't taking >>> the next node out of the topology before the topology has settled from the >>> prior step. Maybe implement a thin client that executes a Kubernetes >>> operation while listening for remote Ignite events to confirm the operation >>> has succeeded to perform the rolling restart. Please share your code! >>> >>> On Thu, Mar 5, 2026 at 10:20 AM Felipe Kersting < >>> [email protected]> wrote: >>> >>>> Hello Ignite devs, >>>> >>>> We are in the process of introducing Apache Ignite into our application >>>> (replacing another technology) and are currently testing our rollout >>>> strategy. >>>> >>>> During a rollout, Ignite server nodes are terminated and new nodes are >>>> started one after another (Kubernetes-style rolling update). As a result, >>>> nodes leave and join the cluster continuously. At the moment we are testing >>>> a pure in-memory deployment (no persistence / no baseline topology >>>> configured). >>>> >>>> While running these tests, we noticed that thick clients commonly hit >>>> `ClusterTopologyException` during the rollout—most often when interacting >>>> with caches (typically wrapped in `CacheException`). We have also seen >>>> other rollout-related issues (including the deadlock previously discussed >>>> in this thread), but this email focuses specifically on >>>> `ClusterTopologyException`. >>>> >>>> The documentation suggests that callers should "wait on the future and >>>> use retry logic": >>>> [ >>>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions](https://ignite.apache.org/docs/latest/perf-and-troubleshooting/handling-exceptions) >>>> >>>> In our case, the future embedded in the exception is frequently `null`, >>>> so we implemented a retry layer that retries cache operations with backoff >>>> whenever `ClusterTopologyException` is thrown. This seems to keep the >>>> client stable during rollouts, though at the cost of extra latency. >>>> >>>> Our question is about correctness / idempotency: is it safe to blindly >>>> retry cache operations when `ClusterTopologyException` occurs? >>>> >>>> In particular, we are concerned about the following operations: >>>> >>>> * `IgniteCache::putAll` >>>> * `IgniteCache::clear` >>>> * `IgniteCache::removeAll` >>>> * `IgniteCache::forEach` >>>> * `IgniteCache::invoke` >>>> * `IgniteCache::invokeAll` >>>> >>>> For example: >>>> >>>> * If `ClusterTopologyException` is thrown from `IgniteCache::forEach`, >>>> is it guaranteed that the operation was not executed for any key, or can it >>>> be partially executed for a subset of keys? >>>> * Likewise for `invoke` / `invokeAll`: is it guaranteed that the >>>> `EntryProcessor` was not executed at all, or could it have been executed >>>> (fully or partially) before the exception was surfaced to the client? >>>> >>>> If partial execution is possible, then a blind retry could result in >>>> duplicate effects for an arbitrary subset of keys, which could be >>>> problematic depending on the operation semantics. >>>> >>>> Any guidance on the expected guarantees here (or best practices for >>>> designing a safe retry strategy in this scenario) would be greatly >>>> appreciated. >>>> >>>> Thank you, >>>> Felipe >>>> >>>
