Just speaking up as a supporter for considering this change. From a userland perspective, I've been reading up on CRaC, and I see this attacking some of the same requirements that Graal and Quarkus are trying to solve. This is a worth direction to pursue.
The CqlSession will need to re-connect, and I think that's worth testing. Topology information shouldn't be assumed, especially with something like Token-Aware Routing. Some shortcuts could speed it up, but I can't think of any right now. I like the idea of making it optional and putting it through some scenarios. Patrick On Mon, Mar 10, 2025 at 8:03 AM Radim Vansa <rva...@azul.com> wrote: > Hello Josh, > thanks for reaching back; answers inline: > > On 10. 03. 25 13:03, Josh McKenzie wrote: > > > From skimming the PR on the Spring side and the conversation there, it > looks like the argument is to have this live inside the java driver for > Cassandra instead of in the spring-boot lib which I can see the argument > for. > > > Yes; for us it does not really matter where the fix lives as long as it's > available for the end users. Pushing it towards Cassandra has the advantage > to provide the greatest fan-out to users, even those not consuming through > frameworks. > > > If we distill this to speak to precisely the problem we're trying to > address or improvement we're going for here, how would you phrase that? > i.e. "Take application startup from Nms down to Mms"? > > > Yes, optimizing startup time is the most common use-case for CRaC. It's > rather hard to provide such general numbers: it should be order(s) of > magnitude. If we speak about hello-world style Spring Boot application > booting, CRaC improves the startup from seconds to tens of milliseconds. > That shouldn't differ too much from the expected times for a small > micro-service, improving latency in scale-from-zero situations. This is not > limited to microservices, though; we've been experimenting with real > applications consuming hundreds of GB of memory. In that case the > application boot can be rather complex, loading and pre-processing data > from DB etc. where the boot takes minutes or more. CRaC can restore such > instance in a few seconds. > > > I ask because that's the "pro" we'll need to weigh against updating the > driver's topology map of the cluster, resource handling and potential leaks > on shutdown/startup, and the complexity of taking an implementation like > this into the driver code. Nothing insurmountable of course, just worth > weighing the two. > > > Can you elaborate about other use cases where the nodes are forced down, > and what risk does that bring to the overall stability? Is there a > difference between marking only a subset of nodes down and taking all of > the nodes down? When we force-close the control connection (as the first > step), is it possible to get a topology update at all and race on the > cluster members? > > Thank you! > > Radim > > > > On Thu, Mar 6, 2025, at 3:34 PM, Radim Vansa wrote: > > Hi all, > > I would like to make applications using Cassandra Java Driver, > particularly those built with Spring Boot, Quarkus or similar > frameworks, work with OpenJDK CRaC project [1]. I've already created a > patch for Spring Boot [2] but Spring folks think that these changes are > too dependent on driver internals, suggesting to contribute a support to > Cassandra directly. > > The patch involves closing all connections before checkpoint, and > re-establishing these after restore. I have implemented that though > sending a `NodeStateEvent -> FORCED_DOWN` on the bus for all connected > nodes. As a follow-up I could develop some way to inform the session > about a new topology e.g. if the cluster addresses change. > > Before jumping onto implementing a PR I would like to ask what you think > is the best approach to do this. I can think of two ways: > > 1) Native CRaC support > > The driver would have a dependency on `org.crac:crac` [3]; this is a > small (13kB) library that provides the interfaces and a dummy noop > implementation if the target JVM does not support CRaC. Then > `DefaultSession` would register a `org.crac.Resource` implementation > that would handle the checkpoint. This has the advantage of providing > best fan-out into any project consuming the driver without any further > work. > > 2) Exposing neutral methods > > To save frameworks of relying on internals, `DefaultSession` would > expose `.suspend()` and `.resume()` methods that would implement the > connection cut-off without importing any dependency. After upgrade to > latest release, frameworks could use these methods in a way that suits > them. I wouldn't add those methods to the `CqlSession` interface (as > that would be breaking change) but only to `DefaultSession`. > > Would Cassandra accept either of these, to let people checkpoint > (snapshot) their applications and restore them within tens of > milliseconds? Naturally it is possible to close the session object > completely and create a new one, but the ideal solution would require no > application changes beyond dependency upgrade. > > Btw. I am aware that there is an inherent race between possible topology > change and shutdown of current nodes (and I am listening for hints that > would let us prevent that), but it is reasonable to expect that users > will checkpoint the application in a quiescent state. And if the > topology update breaks the checkpoint, it is always possible to try it > again. > > Thank you for your opinions and ideas! > > Radim Vansa > > > [1] https://wiki.openjdk.org/display/crac > > [2] https://github.com/spring-projects/spring-boot/pull/44505 > > [3] https://mvnrepository.com/artifact/org.crac/crac/1.5.0 > > > >