Hello Felipe, Thank you very much for your detailed bug report and investigation. Reports like this help us improve Ignite and make it more reliable. Please feel free to share any other issues you encounter - your feedback is very valuable for the community.
We have reviewed your findings and the code references you provided. We agree that this looks like a real liveness issue in IgniteLock. We have created a JIRA ticket to track it: https://issues.apache.org/jira/browse/IGNITE-27962 We will continue investigating the root cause and possible fixes. If possible, could you also share debug logs around the topology change and lock acquisition (from both the client and the server node that was stopped)? In particular, logs covering: - the transaction commit, - continuous query processing, - node left / topology change events. This may help us better understand the race condition and validate a fix. As a temporary workaround, you may try using IgniteSemaphore(1, failoverSafe=true) instead of IgniteLock, if reentrancy is not required in your use case. In addition, we are currently working on improved Rolling Upgrade functionality (IEP-132): https://cwiki.apache.org/confluence/display/IGNITE/IEP-132+Rolling+Upgrade This feature is under active development, and we plan to finalize it in upcoming releases. We expect it to improve stability and behavior during node restarts and cluster upgrades. Thank you again for your contribution and detailed analysis. -- Best regards, Aleksandr Chesnokov
