Just speaking up as a supporter for considering this change. From a
userland perspective, I've been reading up on CRaC, and I see this
attacking some of the same requirements that Graal and Quarkus are trying
to solve. This is a worth direction to pursue.

The CqlSession will need to re-connect, and I think that's worth testing.
Topology information shouldn't be assumed, especially with something like
Token-Aware Routing. Some shortcuts could speed it up, but I can't think of
any right now. I like the idea of making it optional and putting it through
some scenarios.

Patrick

On Mon, Mar 10, 2025 at 8:03 AM Radim Vansa <rva...@azul.com> wrote:

> Hello Josh,
> thanks for reaching back; answers inline:
>
> On 10. 03. 25 13:03, Josh McKenzie wrote:
>
>
> From skimming the PR on the Spring side and the conversation there, it
> looks like the argument is to have this live inside the java driver for
> Cassandra instead of in the spring-boot lib which I can see the argument
> for.
>
>
> Yes; for us it does not really matter where the fix lives as long as it's
> available for the end users. Pushing it towards Cassandra has the advantage
> to provide the greatest fan-out to users, even those not consuming through
> frameworks.
>
>
> If we distill this to speak to precisely the problem we're trying to
> address or improvement we're going for here, how would you phrase that?
> i.e. "Take application startup from Nms down to Mms"?
>
>
> Yes, optimizing startup time is the most common use-case for CRaC. It's
> rather hard to provide such general numbers: it should be order(s) of
> magnitude. If we speak about hello-world style Spring Boot application
> booting, CRaC improves the startup from seconds to tens of milliseconds.
> That shouldn't differ too much from the expected times for a small
> micro-service, improving latency in scale-from-zero situations. This is not
> limited to microservices, though; we've been experimenting with real
> applications consuming hundreds of GB of memory. In that case the
> application boot can be rather complex, loading and pre-processing data
> from DB etc. where the boot takes minutes or more. CRaC can restore such
> instance in a few seconds.
>
>
> I ask because that's the "pro" we'll need to weigh against updating the
> driver's topology map of the cluster, resource handling and potential leaks
> on shutdown/startup, and the complexity of taking an implementation like
> this into the driver code. Nothing insurmountable of course, just worth
> weighing the two.
>
>
> Can you elaborate about other use cases where the nodes are forced down,
> and what risk does that bring to the overall stability? Is there a
> difference between marking only a subset of nodes down and taking all of
> the nodes down? When we force-close the control connection (as the first
> step), is it possible to get a topology update at all and race on the
> cluster members?
>
> Thank you!
>
> Radim
>
>
>
> On Thu, Mar 6, 2025, at 3:34 PM, Radim Vansa wrote:
>
> Hi all,
>
> I would like to make applications using Cassandra Java Driver,
> particularly those built with Spring Boot, Quarkus or similar
> frameworks, work with OpenJDK CRaC project [1]. I've already created a
> patch for Spring Boot [2] but Spring folks think that these changes are
> too dependent on driver internals, suggesting to contribute a support to
> Cassandra directly.
>
> The patch involves closing all connections before checkpoint, and
> re-establishing these after restore. I have implemented that though
> sending a `NodeStateEvent -> FORCED_DOWN` on the bus for all connected
> nodes. As a follow-up I could develop some way to inform the session
> about a new topology e.g. if the cluster addresses change.
>
> Before jumping onto implementing a PR I would like to ask what you think
> is the best approach to do this. I can think of two ways:
>
> 1) Native CRaC support
>
> The driver would have a dependency on `org.crac:crac` [3]; this is a
> small (13kB) library that provides the interfaces and a dummy noop
> implementation if the target JVM does not support CRaC. Then
> `DefaultSession` would register a `org.crac.Resource` implementation
> that would handle the checkpoint. This has the advantage of providing
> best fan-out into any project consuming the driver without any further
> work.
>
> 2) Exposing neutral methods
>
> To save frameworks of relying on internals, `DefaultSession` would
> expose `.suspend()` and `.resume()` methods that would implement the
> connection cut-off without importing any dependency. After upgrade to
> latest release, frameworks could use these methods in a way that suits
> them. I wouldn't add those methods to the `CqlSession` interface (as
> that would be breaking change) but only to `DefaultSession`.
>
> Would Cassandra accept either of these, to let people checkpoint
> (snapshot) their applications and restore them within tens of
> milliseconds? Naturally it is possible to close the session object
> completely and create a new one, but the ideal solution would require no
> application changes beyond dependency upgrade.
>
> Btw. I am aware that there is an inherent race between possible topology
> change and shutdown of current nodes (and I am listening for hints that
> would let us prevent that), but it is reasonable to expect that users
> will checkpoint the application in a quiescent state. And if the
> topology update breaks the checkpoint, it is always possible to try it
> again.
>
> Thank you for your opinions and ideas!
>
> Radim Vansa
>
>
> [1] https://wiki.openjdk.org/display/crac
>
> [2] https://github.com/spring-projects/spring-boot/pull/44505
>
> [3] https://mvnrepository.com/artifact/org.crac/crac/1.5.0
>
>
>
>

Reply via email to