Hi all,

I would like to make applications using Cassandra Java Driver, particularly those built with Spring Boot, Quarkus or similar frameworks, work with OpenJDK CRaC project [1]. I've already created a patch for Spring Boot [2] but Spring folks think that these changes are too dependent on driver internals, suggesting to contribute a support to Cassandra directly.

The patch involves closing all connections before checkpoint, and re-establishing these after restore. I have implemented that though sending a `NodeStateEvent -> FORCED_DOWN` on the bus for all connected nodes. As a follow-up I could develop some way to inform the session about a new topology e.g. if the cluster addresses change.

Before jumping onto implementing a PR I would like to ask what you think is the best approach to do this. I can think of two ways:

1) Native CRaC support

The driver would have a dependency on `org.crac:crac` [3]; this is a small (13kB) library that provides the interfaces and a dummy noop implementation if the target JVM does not support CRaC. Then `DefaultSession` would register a `org.crac.Resource` implementation that would handle the checkpoint. This has the advantage of providing best fan-out into any project consuming the driver without any further work.

2) Exposing neutral methods

To save frameworks of relying on internals, `DefaultSession` would expose `.suspend()` and `.resume()` methods that would implement the connection cut-off without importing any dependency. After upgrade to latest release, frameworks could use these methods in a way that suits them. I wouldn't add those methods to the `CqlSession` interface (as that would be breaking change) but only to `DefaultSession`.

Would Cassandra accept either of these, to let people checkpoint (snapshot) their applications and restore them within tens of milliseconds? Naturally it is possible to close the session object completely and create a new one, but the ideal solution would require no application changes beyond dependency upgrade.

Btw. I am aware that there is an inherent race between possible topology change and shutdown of current nodes (and I am listening for hints that would let us prevent that), but it is reasonable to expect that users will checkpoint the application in a quiescent state. And if the topology update breaks the checkpoint, it is always possible to try it again.

Thank you for your opinions and ideas!

Radim Vansa


[1] https://wiki.openjdk.org/display/crac

[2] https://github.com/spring-projects/spring-boot/pull/44505

[3] https://mvnrepository.com/artifact/org.crac/crac/1.5.0

Reply via email to