It anyway seems reasonable to me that we would support multiple FileSystemProvider. So perhaps this is really two problems we’re maybe conflating: 

1) a mechanism for dropping jars that can register a FileSystemProvider for Cassandra to utilise
2) a way to mark directories (from any provider) as “remote storage” so that they can be treated in the appropriate manner

I think the harder problems by far live in (2). Less for performance, but more how we handle errors correctly. For instance, a failure to read from object storage may mean that the data is lost or it may mean the service has been interrupted. This might mean a total or partial loss of any intersecting tokens, depending on how the cluster stripes dependency on the object storage. But we almost certainly don’t want to handle this like we do local disk errors, either way.

On 6 Mar 2025, at 15:16, Jon Haddad <j...@rustyrazorblade.com> wrote:


Assuming everything else is identical, might not matter for S3. However, not every object store has a filesystem mount. 

Regarding sprawling dependencies, we can always make the provider specific libraries available as a separate download and put them on their own thread with a separate class path. I think in JVM dtest does this already.  Someone just started asking about IAM for login, it sounds like a similar problem. 


On Thu, Mar 6, 2025 at 12:53 AM Benedict <bened...@apache.org> wrote:
I think another way of saying what Stefan may be getting at is what does a library give us that an appropriately configured mount dir doesn’t?

We don’t want to treat S3 the same as local disk, but this can be achieved easily with config. Is there some other benefit of direct integration? Well defined exceptions if we need to distinguish cases is one that maybe springs to mind but perhaps there are others?


On 6 Mar 2025, at 08:39, Štefan Miklošovič <smikloso...@apache.org> wrote:


That is cool but this still does not show / explain how it would look like when it comes to dependencies needed for actually talking to storages like s3. 

Maybe I am missing something here and please explain when I am mistaken but If I understand that correctly, for talking to s3 we would need to use a library like this, right? (1). So that would be added among Cassandra dependencies? Hence Cassandra starts to be biased against s3? Why s3? Every time somebody comes up with a new remote storage support, that would be added to classpath as well? How are these dependencies going to play with each other and with Cassandra in general? Will all these storage provider libraries for arbitrary clouds be even compatible with Cassandra licence-wise?

I am sorry I keep repeating these questions but this part of that I just don't get at all. 

We can indeed add an API for this, sure sure, why not. But for people who do not want to deal with this at all and just be OK with a FS mounted, why would we block them doing that?


On Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever <m...@apache.org> wrote:
   .
  

It’s not an area where I can currently dedicate engineering effort. But if others are interested in contributing a feature like this, I’d see it as valuable for the project and would be happy to collaborate on design/architecture/goals.


Jake mentioned 17 months ago a custom FileSystemProvider we could offer.

None of us at DataStax has gotten around to providing that, but to quickly throw something over the wall this is it:
  (with a few friend classes under o.a.c.io.util)

We then have a RemoteStorageProvider, private in another repo, that implements that and also provides the RemoteFileSystemProvider that Jake refers to.

Hopefully that's a start to get people thinking about CEP level details, while we get a cleaned abstract of RemoteStorageProvider and friends to offer.

Reply via email to