Hi all,
I would like to make applications using Cassandra Java Driver,
particularly those built with Spring Boot, Quarkus or similar
frameworks, work with OpenJDK CRaC project [1]. I've already created a
patch for Spring Boot [2] but Spring folks think that these changes are
too dependent on driver internals, suggesting to contribute a support to
Cassandra directly.
The patch involves closing all connections before checkpoint, and
re-establishing these after restore. I have implemented that though
sending a `NodeStateEvent -> FORCED_DOWN` on the bus for all connected
nodes. As a follow-up I could develop some way to inform the session
about a new topology e.g. if the cluster addresses change.
Before jumping onto implementing a PR I would like to ask what you think
is the best approach to do this. I can think of two ways:
1) Native CRaC support
The driver would have a dependency on `org.crac:crac` [3]; this is a
small (13kB) library that provides the interfaces and a dummy noop
implementation if the target JVM does not support CRaC. Then
`DefaultSession` would register a `org.crac.Resource` implementation
that would handle the checkpoint. This has the advantage of providing
best fan-out into any project consuming the driver without any further work.
2) Exposing neutral methods
To save frameworks of relying on internals, `DefaultSession` would
expose `.suspend()` and `.resume()` methods that would implement the
connection cut-off without importing any dependency. After upgrade to
latest release, frameworks could use these methods in a way that suits
them. I wouldn't add those methods to the `CqlSession` interface (as
that would be breaking change) but only to `DefaultSession`.
Would Cassandra accept either of these, to let people checkpoint
(snapshot) their applications and restore them within tens of
milliseconds? Naturally it is possible to close the session object
completely and create a new one, but the ideal solution would require no
application changes beyond dependency upgrade.
Btw. I am aware that there is an inherent race between possible topology
change and shutdown of current nodes (and I am listening for hints that
would let us prevent that), but it is reasonable to expect that users
will checkpoint the application in a quiescent state. And if the
topology update breaks the checkpoint, it is always possible to try it
again.
Thank you for your opinions and ideas!
Radim Vansa
[1] https://wiki.openjdk.org/display/crac
[2] https://github.com/spring-projects/spring-boot/pull/44505
[3] https://mvnrepository.com/artifact/org.crac/crac/1.5.0