I can only speak for myself, but the overhead of managing the accord submodule has been fairly low. It does mean opening two PRs when a change touches both projects (which is often the case for accord), but I think for utility classes this would be infrequent anyway.
I think any modules within Cassandra proper (including eg APIs) should live in the same repository though, since they are already necessarily coupled.
On 17 Mar 2025, at 03:08, Dinesh Joshi <djo...@apache.org> wrote:
Definitely supportive of modularizing code but from a developer productivity standpoint we should discuss the overhead of managing changes across multiple repos. I want to break out at least one or two shared library projects. Both accord and in-jvm-dtest-api should share code with the Cassandra main project, particularly executors/futures/collections/concurrency utilities. This is something that has caused me some recurring friction over the past few years, so if there’s appetite I may try to pursue it in the near future.
I also like the idea of defining our public APIs in a separate jar/folder/source tree. This helpfully also solves the never-ending discussion topic of how we define what our public APIs are. I don’t have any cycles for this, but I doubt it would be controversial.
I am less sure about how we might go about breaking up the internals of Cassandra itself, but the accord project is perhaps a step in this direction.
That all said, plugin dependencies are a much easier problem than this. We don’t need to run the plugins on their own threads; they just need their own class loader - which is anyway probably a good idea. We can perhaps even reuse the logic we already have for loading UDFs, but relax some of the restrictions.
I've gotten the impression that there's not a lot of enthusiasm for breaking apart the main Cassandra module, but I have wondered if it'd be worth making an exception for the interfaces plugins are supposed to code against
Oh, there's plenty of enthusiasm. There's been a shortage of consensus however. For now. :D
I think breaking out the interfaces first makes a lot of sense as that'd allow us to focus almost purely on build dependency and environmental factors w/out having to reason through implementation code movements and encapsulation breakage. I believe there's folks working on exploring the current build system through the lens of requirements to break out shared deps; I'll see if I can't rustle them up.
On Thu, Mar 6, 2025, at 4:06 PM, Joel Shepherd wrote:
Splitting this out from the CEP-36 thread.
I agree: dependency collisions at run-time are a problem. It's made even worse by the possibility of users using multiple plugins (authn, authz, compression, storage, etc.).
It also cuts two ways. E.g. the interfaces that plugin authenticators need to implement are defined in org.apache.cassandra.auth, so as far as I know the plugin has to take a build-time dependency on the main Cassandra module itself, and pull in all of its dependencies. (I'd love to be told that I'm mistaken.) In addition to the risk of version conflicts, it increases the risk of a change to Cassandra's own dependencies inadvertently breaking a plugin that's taken a transitive dependency. Might be bad form on the plugin's part, but certainly possible.
I've gotten the impression that there's not a lot of enthusiasm for breaking apart the main Cassandra module, but I have wondered if it'd be worth making an exception for the interfaces plugins are supposed to code against. It'd be nice to depend on those without pulling in the rest of the project, and it'd be another step towards reducing the risk of plugins breaking because of dependency changes in the main project.
-- Joel.
On 3/6/2025 10:52 AM, Jon Haddad wrote:
Hey Joel, thanks for chiming in!
Regarding dependencies - while it's possible to provide pluggable interfaces, the issue I'm concerned about is conflicting versions of transitive dependencies at runtime. For example, I used a java agent that had a different version of snakeyaml, and it ended up breaking C*'s startup sequence [1]. I suggest putting external modules on separate threads with their own classpath to avoid this issue.
I think there's quite a bit of overlap between the two desires expressed in this thread, even though they achieve very different results. I personally can't see myself using something that treats an object store as cold storage where SSTables are moved (implying they weren't there before), and I've expressed my concerns with this, but other folks seem to want it and that's OK. I feel very strongly that treating local storage as a cache with the full dataset on object store is a better approach, but ultimately different people have different priorities. Either way, stuff is moved to object store at some point, and pulled to the local disk on demand.
I am *firmly* of the position that this CEP should not exclude the local storage as cache option, and should be accounted for in the design.
Jon
On 3/6/2025 7:16 AM, Jon Haddad wrote:
Assuming everything else is identical, might not matter for S3. However, not every object store has a filesystem mount.
Regarding sprawling dependencies, we can always make the provider specific libraries available as a separate download and put them on their own thread with a separate class path. I think in JVM dtest does this already. Someone just started asking about IAM for login, it sounds like a similar problem.
That was me. :-) Cassandra's auth already has fairly well defined interfaces and a plug-in mechanism, so it's easy to vend alternative auth solutions without polluting the main project's dependency graph, at build-time anyway. A similar approach could be beneficial for CEP-36, particularly (IMO) for cold-storage purposes. I suspect decoupling pluggable alternate channel proxies for cold storage from configurable alternate channel proxies for redirecting data locally to free up space, migrate to a different storage device, etc., would make both easier. The CEP seems to be trying to do both, but they smell like pretty different goals to me.
Thanks -- Joel.
I think another way of saying what Stefan may be getting at is what does a library give us that an appropriately configured mount dir doesn’t?
We don’t want to treat S3 the same as local disk, but this can be achieved easily with config. Is there some other benefit of direct integration? Well defined exceptions if we need to distinguish cases is one that maybe springs to mind but perhaps there are others?
That is cool but this still does not show / explain how it would look like when it comes to dependencies needed for actually talking to storages like s3.
Maybe I am missing something here and please explain when I am mistaken but If I understand that correctly, for talking to s3 we would need to use a library like this, right? (1). So that would be added among Cassandra dependencies? Hence Cassandra starts to be biased against s3? Why s3? Every time somebody comes up with a new remote storage support, that would be added to classpath as well? How are these dependencies going to play with each other and with Cassandra in general? Will all these storage provider libraries for arbitrary clouds be even compatible with Cassandra licence-wise?
I am sorry I keep repeating these questions but this part of that I just don't get at all.
We can indeed add an API for this, sure sure, why not. But for people who do not want to deal with this at all and just be OK with a FS mounted, why would we block them doing that?
It’s not an area where I can currently dedicate engineering effort. But if others are interested in contributing a feature like this, I’d see it as valuable for the project and would be happy to collaborate on design/architecture/goals.
Jake mentioned 17 months ago a custom FileSystemProvider we could offer.
None of us at DataStax has gotten around to providing that, but to quickly throw something over the wall this is it:
(with a few friend classes under o.a.c.io.util)
We then have a RemoteStorageProvider, private in another repo, that implements that and also provides the RemoteFileSystemProvider that Jake refers to.
Hopefully that's a start to get people thinking about CEP level details, while we get a cleaned abstract of RemoteStorageProvider and friends to offer.
|