Splitting this out from the CEP-36 thread.
I agree: dependency collisions at run-time are a problem. It's made even
worse by the possibility of users using multiple plugins (authn, authz,
compression, storage, etc.).
It also cuts two ways. E.g. the interfaces that plugin authenticators
need to implement are defined in org.apache.cassandra.auth, so as far as
I know the plugin has to take a build-time dependency on the main
Cassandra module itself, and pull in all of its dependencies. (I'd love
to be told that I'm mistaken.) In addition to the risk of version
conflicts, it increases the risk of a change to Cassandra's own
dependencies inadvertently breaking a plugin that's taken a transitive
dependency. Might be bad form on the plugin's part, but certainly possible.
I've gotten the impression that there's not a lot of enthusiasm for
breaking apart the main Cassandra module, but I have wondered if it'd be
worth making an exception for the interfaces plugins are supposed to
code against. It'd be nice to depend on those without pulling in the
rest of the project, and it'd be another step towards reducing the risk
of plugins breaking because of dependency changes in the main project.
-- Joel.
On 3/6/2025 10:52 AM, Jon Haddad wrote:
Hey Joel, thanks for chiming in!
Regarding dependencies - while it's possible to provide pluggable
interfaces, the issue I'm concerned about is conflicting versions of
transitive dependencies at runtime. For example, I used a java agent
that had a different version of snakeyaml, and it ended up breaking
C*'s startup sequence [1]. I suggest putting external modules on
separate threads with their own classpath to avoid this issue.
I think there's quite a bit of overlap between the two desires
expressed in this thread, even though they achieve very different
results. I personally can't see myself using something that treats an
object store as cold storage where SSTables are moved (implying they
weren't there before), and I've expressed my concerns with this, but
other folks seem to want it and that's OK. I feel very strongly that
treating local storage as a cache with the full dataset on object
store is a better approach, but ultimately different people have
different priorities. Either way, stuff is moved to object store at
some point, and pulled to the local disk on demand.
I am *firmly* of the position that this CEP should not exclude the
local storage as cache option, and should be accounted for in the design.
Jon
[1] https://issues.apache.org/jira/browse/CASSANDRA-19663
On Thu, Mar 6, 2025 at 10:31 AM Joel Shepherd <sheph...@amazon.com> wrote:
On 3/6/2025 7:16 AM, Jon Haddad wrote:
Assuming everything else is identical, might not matter for S3.
However, not every object store has a filesystem mount.
Regarding sprawling dependencies, we can always make the provider
specific libraries available as a separate download and put them
on their own thread with a separate class path. I think in JVM
dtest does this already. Someone just started asking about IAM
for login, it sounds like a similar problem.
That was me. :-) Cassandra's auth already has fairly well defined
interfaces and a plug-in mechanism, so it's easy to vend
alternative auth solutions without polluting the main project's
dependency graph, at build-time anyway. A similar approach could
be beneficial for CEP-36, particularly (IMO) for cold-storage
purposes. I suspect decoupling pluggable alternate channel proxies
for cold storage from configurable alternate channel proxies for
redirecting data locally to free up space, migrate to a different
storage device, etc., would make both easier. The CEP seems to be
trying to do both, but they smell like pretty different goals to me.
Thanks -- Joel.
On Thu, Mar 6, 2025 at 12:53 AM Benedict <bened...@apache.org> wrote:
I think another way of saying what Stefan may be getting at
is what does a library give us that an appropriately
configured mount dir doesn’t?
We don’t want to treat S3 the same as local disk, but this
can be achieved easily with config. Is there some other
benefit of direct integration? Well defined exceptions if we
need to distinguish cases is one that maybe springs to mind
but perhaps there are others?
On 6 Mar 2025, at 08:39, Štefan Miklošovič
<smikloso...@apache.org> wrote:
That is cool but this still does not show / explain how it
would look like when it comes to dependencies needed for
actually talking to storages like s3.
Maybe I am missing something here and please explain when I
am mistaken but If I understand that correctly, for talking
to s3 we would need to use a library like this, right? (1).
So that would be added among Cassandra dependencies? Hence
Cassandra starts to be biased against s3? Why s3? Every time
somebody comes up with a new remote storage support, that
would be added to classpath as well? How are these
dependencies going to play with each other and with
Cassandra in general? Will all these storage
provider libraries for arbitrary clouds be even compatible
with Cassandra licence-wise?
I am sorry I keep repeating these questions but this part of
that I just don't get at all.
We can indeed add an API for this, sure sure, why not. But
for people who do not want to deal with this at all and just
be OK with a FS mounted, why would we block them doing that?
(1)
https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xml
On Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever
<m...@apache.org> wrote:
.
It’s not an area where I can currently dedicate
engineering effort. But if others are interested in
contributing a feature like this, I’d see it as
valuable for the project and would be happy to
collaborate on design/architecture/goals.
Jake mentioned 17 months ago a custom FileSystemProvider
we could offer.
None of us at DataStax has gotten around to providing
that, but to quickly throw something over the wall this
is it:
https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java
(with a few friend classes under o.a.c.io.util)
We then have a RemoteStorageProvider, private in another
repo, that implements that and also provides the
RemoteFileSystemProvider that Jake refers to.
Hopefully that's a start to get people thinking about
CEP level details, while we get a cleaned abstract of
RemoteStorageProvider and friends to offer.