Jon, I like where you are headed with that, just brainstorming out what the end interface might look like (might be getting a bit ahead of things talking about directories if we don't even have files implemented yet). What do folks think about pairing data_file_locations (fka data_file_directories) with three table level tunables: replication strategy ("spread", "tier"), the eviction strategy ("none", "lfu", "lru", etc...) and the writeback duration (0s, 10s, 8d)? So your three examples
data_file_locations: disk: {type: "filesystem", "path": "/var/lib/cassandra/data"} object: {type: "s3", "path": "s3://..."} data_file_eviction_strategies: none: {type: "NONE"} lfu: {type: "LFU", "min_retention": "7d"} hot-cold: {type: "LRU", "min_retention": "60d"} Then on tables to achieve your three proposals WITH storage = {locations: ["disk", "object"], "replication": "tier"} WITH storage = {locations: ["disk", "object"], "replication": "tier", "eviction": ["lfu"], "writeback": ["10s"]} WITH storage = {locations: ["disk", "object"], "replication": "tier", "eviction": ["hot-cold"], "writeback": ["8d"]} We definitely wouldn't want to implement all of the eviction strategies in the first cut - probably just the object file location (CEP 36 fwict) and eviction strategy "none". Default eviction would be "none" if not specified (throw errors on full storage), default writeback would be "0s". I am thinking that the strategies maybe should live in cassandra.yaml and not table properties because they're usually a property of the node being setup, tables can't magically make you have more locations to store data in. Nice thing about this is that our JBOD thing with multiple directories just becomes a replication strategy, default stays the same (everything local), and we can add more eviction strategies as folks want them. One could even imagine crazy tuned setups like tmpfs -> local-ssd -> remote-ssd -> remote-object configurations >_< -Joey On Sat, Mar 8, 2025 at 9:33 AM Jon Haddad <j...@rustyrazorblade.com> wrote: > Thanks Jordan and Joey for the additional info. > > One thing I'd like to clarify - what I'm mostly after is 100% of my data > on object store, local disk acting as a LRU cache, although there's also a > case for the mirror. > > What I see so far are three high level ways of running this: > > 1. Mirror Mode > > This is 100% on both. When SSTables are written, they'd be written to the > object store as well, and we'd block till it's fully written to the object > store to ensure durability. We don't exceed the space of the disk. This > is essentially a RAID 1. > > 2. LRU > > This behaves similarly to mirror mode, but we can drop local data to make > room. If we need it from the object store we can fetch it, with a bit of > extra latency. This is the mode I'm most interested in. I have customers > today that would use this because it would cut their operational costs > massively. This is like the LVM cache pool or the S3 cache Joey > described. You should probably be able to configure how much of your local > disk you want to keep free, or specify a minimum period to keep local. > Might be some other fine grained control here. > > 3. Tiered Mode > > Data is written locally, but not immediately replicated to S3. This is > the TWCS scenario, where you're using window of size X, you copy the data > to object store after Y period, and keep it local for Z. This allows you > to avoid pushing data up every time it's compacted, you keep all hot data > local, and you tier it off. Maybe you use 7 day TWCS window, copy after 8 > days, and retain local for 60. For TWCS to work well here, it'll should be > updated to use a max sstable size for the final compaction. > > To me, all 3 of these should be possible, and each has valid use cases. I > think it should also be applicable per table, because some stuff you want > to mirror (my user table), some stuff you want to LRU (infrequently > accessed nightly jobs), and some stuff you want to tier off (user > engagement). > > What it really boils down to are these properties: > > * Do we upload to object store when writing the SSTable? > * Do we upload after a specific amount of time? > * Do we free up space locally when we reach a threshold? > * Do we delete local files after a specific time period? > > If this is per table, then it makes sense to me to follow the same pattern > as compaction, giving us something like this: > > WITH storage = {"class": "Mirror"} > WITH storage = {"class": "LRU", "min_time_to_retain":"7 days"} > WITH storage = {"class": "Tiered", "copy_after":"8 days", > "min_time_to_retain":"60 days"} > > Something like that, at least. > > Jon > > > > On Sat, Mar 8, 2025 at 5:39 AM Joseph Lynch <joe.e.ly...@gmail.com> wrote: > >> Great discussion - I agree strongly with Jon's points, giving operators >> this option will make many operator's lives easier. Even if you still have >> to have 100% disk space to meet performance requirements, that's still much >> more efficient than you can run C* with just disks (as you need to leave >> headroom). Jon's points are spot on regarding replacement, backup, and >> analytics. >> >> Regarding Cheng and Jordan's concern around mount perf, I want to share >> the "why". In the test Chris and I ran, we used s3-mountpoint's built-in >> cache [1] which uses the local filesystem to cache remote calls. While all >> data was in ephemeral cache and compaction wasn't running, performance was >> excellent for both writes and reads. The problem was that cache did not >> work well with compaction - it caused cache evictions of actively read hot >> data at well below full disk usage. The architecture was fine - it was the >> implementation of the cache that was poor (and talking to AWS S3 engineers >> about it, good cache eviction is a harder problem for them to solve >> generally and is relatively low on their priority list at least right now) >> - to me this strongly supports the need for this CEP so the project can >> decide how to make these tradeoffs. >> >> Allowing operators to choose to treat Cassandra as a cache just seems >> good because caches are easier to run - you just have to manage cache >> filling and dumping, and monitor your cache hit ratio. Like Jon says, some >> people may want 100% hit ratio, but others may be ok with lower ratios (or >> even zero ratio) to save cost. If the proposed ChannelProxy implemented >> write-through caching with local disk (which it should probably do for >> availability anyways), would that alleviate some of the concerns? That >> would let operators choose to provision enough local disk somewhere on the >> spectrum of: >> >> a) hold everything (full HA, lowest latency, still cheaper than status >> quo and easier) >> b) hold commitlogs and hot sstables (hot-cold storage, middle option that >> has degraded p99 read latency for cold data) >> c) hold commitlogs (just HA writes, high and variable latency on reads) >> >> I'll also note that when the cluster degrades from a to b, in contrast to >> the status quo where you lose data and blow up, this would just get slower >> on the p99 read - and since replacement is so easy in comparison, >> recovering would be straightforward (move to machines with more disk and >> run the local disk equivalent of happycache load [1]). >> >> -Joey >> >> [1] >> https://github.com/awslabs/mountpoint-s3/blob/main/doc/CONFIGURATION.md#caching-configuration >> [1] https://github.com/hashbrowncipher/happycache >> >> >> On Fri, Mar 7, 2025 at 4:01 PM Jordan West <jw...@apache.org> wrote: >> >>> I too initially felt we should just use mounts and was excited by e.g. >>> Single Zone Express mounting. As Cheng mentioned we tried it…and the >>> results were disappointing (except for use cases who could sometimes >>> tolerate seconds of p99 latency. That brought me around to needing an >>> implementation we own that we can optimize properly as others have >>> discussed. >>> >>> Regarding what percent of data should be in the cold store I would love >>> to see an implementation that allows what Jon is proposing with the full >>> data set and what the original proposal included with partial dataset. I >>> think there are different reasons to use both. >>> >>> Jordan >>> >>> >>> On Fri, Mar 7, 2025 at 11:02 Jon Haddad <j...@rustyrazorblade.com> wrote: >>> >>>> Supporting a filesystem mount is perfectly reasonable. If you wanted to >>>> use that with the S3 mount, there's nothing that should prevent you from >>>> doing so, and the filesystem version is probably the default implementation >>>> that we'd want to ship with since, to your point, it doesn't require >>>> additional dependencies. >>>> >>>> Allowing support for object stores is just that - allowing it. It's >>>> just more functionality, more flexibility. I don't claim to know every >>>> object store or remote filesystem, there are going to be folks that want to >>>> do something custom I can't think of right now and there's no reason to box >>>> ourselves in. The abstraction to allow it is a small one. If folks want >>>> to put SSTables on HDFS, they should be able to. Who am I to police them, >>>> especially if we can do it in a way that doesn't carry any risk, making it >>>> available as a separate plugin? >>>> >>>> My comment with regard to not wanting to treat object store as tiered >>>> storage has everything to do with what I want to do with the data. I want >>>> 100% of it on object store (with a copy of hot data on the local node) for >>>> multiple reasons, some of which (maybe all) were made by Jeff and Scott: >>>> >>>> * Analytics is easier, no transfer off C* >>>> * Replace is dead simple, just pull the data off the mount when it >>>> boots up. Like using EBS, but way cheaper. (thanks Scott for some numbers >>>> on this) >>>> * It makes backups easier, just copy the bucket. >>>> * If you're using object store for backups anyways than I don't see why >>>> you wouldn't keep all your data there >>>> * I hadn't even really thought about scale to zero before but I love >>>> this too >>>> >>>> Some folks want to treat the object store as a second tier, implying to >>>> me that once the SSTables reach a certain age or aren't touched, they're >>>> uploaded to the object store. Supporting this use case shouldn't be that >>>> much different. Maybe you don't care about the most recent data, and >>>> you're OK with losing everything from the last few days, because you can >>>> reload from Kafka. You'd be treating C* as a temporary staging area for >>>> whatever purpose, and you only want to do analytics on the cold data. As >>>> I've expressed already, it's just a difference in policy of when to >>>> upload. Yes I'm being a bit hand wavy about this part but its an email >>>> discussion. I don't have this use case today, but it's certainly valid. >>>> Or maybe it works in conjunction with witness replicas / transient >>>> replication, allowing you to offload data from a node until there's a >>>> failure, in which case C* can grab it. I'm just throwing out ideas here. >>>> >>>> Feel free to treat object store here as "secondary location". It can >>>> be a NAS, an S3 mount, a custom FUSE filesystem, or the S3 api, or whatever >>>> else people come up with. >>>> >>>> In the past, I've made the case that this functionality can be achieved >>>> with LVM cache pools [1][2], and even provided some benchmarks showing it >>>> can be used. I've also argued that we can do node replacements with >>>> rsync. While these things are both technically true, others have convinced >>>> me that having this functionality as first class in the database makes it >>>> easier for our users and thus better for the project. Should someone have >>>> to *just* understand all of LVM, or *just* understand the nuance of >>>> potential data loss due to rsync's default second resolution? I keep >>>> relearning that whenever I say "*just* do X", that X isn't as convenient or >>>> easy for other people as it is for me, and I need to relax a bit. >>>> >>>> The view of the world where someone *just* needs to know dozens of >>>> workarounds to use the database makes it harder for non-experts to use. >>>> The bar for usability is constantly being raised, and whatever makes it >>>> better for a first time user is better for the community. >>>> >>>> Anyways, that's where I'm at now. Easier = better, even if we're >>>> reinventing some of the wheel to do so. Sometimes you get a better wheel, >>>> too. >>>> >>>> Jon >>>> >>>> [1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/ >>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-8460 >>>> >>>> >>>> >>>> On Fri, Mar 7, 2025 at 10:08 AM Štefan Miklošovič < >>>> smikloso...@apache.org> wrote: >>>> >>>>> Because an earlier reply hinted that mounting a bucket yields >>>>> "terrible results". That has moved the discussion, in my mind, practically >>>>> to the place of "we are not going to do this", to which I explained that >>>>> in >>>>> this particular case I do not find the speed important, because the use >>>>> cases you want to use it for do not have anything in common with what I >>>>> want to use it for. >>>>> >>>>> Since then, in my eyes this was binary "either / or", I was repeatedly >>>>> trying to get an opinion about being able to mount it regardless, to >>>>> which, >>>>> afaik, only you explicitly expressed an opinion that it is OK but you are >>>>> not a fan of it: >>>>> >>>>> "I personally can't see myself using something that treats an object >>>>> store as cold storage where SSTables are moved (implying they weren't >>>>> there >>>>> before), and I've expressed my concerns with this, but other folks seem to >>>>> want it and that's OK." >>>>> >>>>> So my assumption was, except you being ok with it, that mounting is >>>>> not viable, so it looks like we are forcing it. >>>>> >>>>> To be super honest, if we made custom storage providers / proxies >>>>> possible and it was already in place, then my urge to do "something fast >>>>> and functioning" (e.g. mounting a bucket) would not exist. I would not use >>>>> mounted buckets if we had this already in place and configurable in such a >>>>> way that we could say that everything except (e.g) snapshots would be >>>>> treated as it is now. >>>>> >>>>> But, I can see how this will take a very, very long time to implement. >>>>> This is a complex CEP to tackle. I remember this topic being discussed in >>>>> the past as well. I _think_ there were at least two occasions when this >>>>> was >>>>> already discussed, that it might be ported / retrofitted from what Mick >>>>> was >>>>> showing etc. Nothing happened. Maybe mounting a bucket is not perfect and >>>>> doing it the other way is a more fitting solution etc. but as the saying >>>>> goes "perfect is the enemy of good". >>>>> >>>>> On Fri, Mar 7, 2025 at 6:32 PM Jon Haddad <j...@rustyrazorblade.com> >>>>> wrote: >>>>> >>>>>> If that's not your intent, then you should be more careful with your >>>>>> replies. When you write something like this: >>>>>> >>>>>> > While this might work, what I find tricky is that we are forcing >>>>>> this to users. Not everybody is interested in putting everything to a >>>>>> bucket and server traffic from that. They just don't want to do that. >>>>>> Because reasons. They are just happy with what they have etc, it works >>>>>> fine >>>>>> for years and so on. They just want to upload SSTables upon snapshotting >>>>>> and call it a day. >>>>>> >>>>>> > I don't think we should force our worldview on them if they are not >>>>>> interested in it. >>>>>> >>>>>> It comes off *extremely* negative. You use the word "force" here >>>>>> multiple times. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Mar 7, 2025 at 9:18 AM Štefan Miklošovič < >>>>>> smikloso...@apache.org> wrote: >>>>>> >>>>>>> I was explaining multiple times (1) that I don't have anything >>>>>>> against what is discussed here. >>>>>>> >>>>>>> Having questions about what that is going to look like does not mean >>>>>>> I am dismissive. >>>>>>> >>>>>>> (1) https://lists.apache.org/thread/ofh2q52p92cr89wh2l3djsm5n9dmzzsg >>>>>>> >>>>>>> On Fri, Mar 7, 2025 at 5:44 PM Jon Haddad <j...@rustyrazorblade.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Nobody is saying you can't work with a mount, and this isn't a >>>>>>>> conversation about snapshots. >>>>>>>> >>>>>>>> Nobody is forcing users to use object storage either. >>>>>>>> >>>>>>>> You're making a ton of negative assumptions here about both the >>>>>>>> discussion, and the people you're having it with. Try to be more open >>>>>>>> minded. >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Mar 7, 2025 at 2:28 AM Štefan Miklošovič < >>>>>>>> smikloso...@apache.org> wrote: >>>>>>>> >>>>>>>>> The only way I see that working is that, if everything was in a >>>>>>>>> bucket, if you take a snapshot, these SSTables would be "copied" from >>>>>>>>> live >>>>>>>>> data dir (living in a bucket) to snapshots dir (living in a bucket). >>>>>>>>> Basically, we would need to say "and if you go to take a snapshot on >>>>>>>>> this >>>>>>>>> table, instead of hardlinking these SSTables, do a copy". But this >>>>>>>>> "copying" would be internal to a bucket itself. We would not need to >>>>>>>>> "upload" from node's machine to s3. >>>>>>>>> >>>>>>>>> While this might work, what I find tricky is that we are forcing >>>>>>>>> this to users. Not everybody is interested in putting everything to a >>>>>>>>> bucket and server traffic from that. They just don't want to do that. >>>>>>>>> Because reasons. They are just happy with what they have etc, it >>>>>>>>> works fine >>>>>>>>> for years and so on. They just want to upload SSTables upon >>>>>>>>> snapshotting >>>>>>>>> and call it a day. >>>>>>>>> >>>>>>>>> I don't think we should force our worldview on them if they are >>>>>>>>> not interested in it. >>>>>>>>> >>>>>>>>> On Fri, Mar 7, 2025 at 11:02 AM Štefan Miklošovič < >>>>>>>>> smikloso...@apache.org> wrote: >>>>>>>>> >>>>>>>>>> BTW, snapshots are quite special because these are not "files", >>>>>>>>>> they are just hard links. They "materialize" as regular files once >>>>>>>>>> underlying SSTables are compacted away. How are you going to >>>>>>>>>> hardlink from >>>>>>>>>> local storage to an object storage anyway? We will always need to >>>>>>>>>> "upload". >>>>>>>>>> >>>>>>>>>> On Fri, Mar 7, 2025 at 10:51 AM Štefan Miklošovič < >>>>>>>>>> smikloso...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> Jon, >>>>>>>>>>> >>>>>>>>>>> all "big three" support mounting a bucket locally. That being >>>>>>>>>>> said, I do not think that completely ditching this possibility for >>>>>>>>>>> Cassandra working with a mount, e.g. for just uploading snapshots >>>>>>>>>>> there >>>>>>>>>>> etc, is reasonable. >>>>>>>>>>> >>>>>>>>>>> GCP >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://cloud.google.com/storage/docs/cloud-storage-fuse/quickstart-mount-bucket >>>>>>>>>>> >>>>>>>>>>> Azure (this one is quite sophisticated), lot of options ... >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://learn.microsoft.com/en-us/azure/storage/blobs/blobfuse2-how-to-deploy?tabs=RHEL >>>>>>>>>>> >>>>>>>>>>> S3, lot of options how to mount that >>>>>>>>>>> >>>>>>>>>>> https://bluexp.netapp.com/blog/amazon-s3-as-a-file-system >>>>>>>>>>> >>>>>>>>>>> On Thu, Mar 6, 2025 at 4:17 PM Jon Haddad < >>>>>>>>>>> j...@rustyrazorblade.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Assuming everything else is identical, might not matter for S3. >>>>>>>>>>>> However, not every object store has a filesystem mount. >>>>>>>>>>>> >>>>>>>>>>>> Regarding sprawling dependencies, we can always make the >>>>>>>>>>>> provider specific libraries available as a separate download and >>>>>>>>>>>> put them >>>>>>>>>>>> on their own thread with a separate class path. I think in JVM >>>>>>>>>>>> dtest does >>>>>>>>>>>> this already. Someone just started asking about IAM for login, it >>>>>>>>>>>> sounds >>>>>>>>>>>> like a similar problem. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Mar 6, 2025 at 12:53 AM Benedict <bened...@apache.org> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I think another way of saying what Stefan may be getting at is >>>>>>>>>>>>> what does a library give us that an appropriately configured >>>>>>>>>>>>> mount dir >>>>>>>>>>>>> doesn’t? >>>>>>>>>>>>> >>>>>>>>>>>>> We don’t want to treat S3 the same as local disk, but this can >>>>>>>>>>>>> be achieved easily with config. Is there some other benefit of >>>>>>>>>>>>> direct >>>>>>>>>>>>> integration? Well defined exceptions if we need to distinguish >>>>>>>>>>>>> cases is one >>>>>>>>>>>>> that maybe springs to mind but perhaps there are others? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 6 Mar 2025, at 08:39, Štefan Miklošovič < >>>>>>>>>>>>> smikloso...@apache.org> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> That is cool but this still does not show / explain how it >>>>>>>>>>>>> would look like when it comes to dependencies needed for actually >>>>>>>>>>>>> talking >>>>>>>>>>>>> to storages like s3. >>>>>>>>>>>>> >>>>>>>>>>>>> Maybe I am missing something here and please explain when I am >>>>>>>>>>>>> mistaken but If I understand that correctly, for talking to s3 we >>>>>>>>>>>>> would >>>>>>>>>>>>> need to use a library like this, right? (1). So that would be >>>>>>>>>>>>> added among >>>>>>>>>>>>> Cassandra dependencies? Hence Cassandra starts to be biased >>>>>>>>>>>>> against s3? Why >>>>>>>>>>>>> s3? Every time somebody comes up with a new remote storage >>>>>>>>>>>>> support, that >>>>>>>>>>>>> would be added to classpath as well? How are these dependencies >>>>>>>>>>>>> going to >>>>>>>>>>>>> play with each other and with Cassandra in general? Will all >>>>>>>>>>>>> these storage >>>>>>>>>>>>> provider libraries for arbitrary clouds be even compatible with >>>>>>>>>>>>> Cassandra >>>>>>>>>>>>> licence-wise? >>>>>>>>>>>>> >>>>>>>>>>>>> I am sorry I keep repeating these questions but this part of >>>>>>>>>>>>> that I just don't get at all. >>>>>>>>>>>>> >>>>>>>>>>>>> We can indeed add an API for this, sure sure, why not. But for >>>>>>>>>>>>> people who do not want to deal with this at all and just be OK >>>>>>>>>>>>> with a FS >>>>>>>>>>>>> mounted, why would we block them doing that? >>>>>>>>>>>>> >>>>>>>>>>>>> (1) >>>>>>>>>>>>> https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xml >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever <m...@apache.org> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> . >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> It’s not an area where I can currently dedicate engineering >>>>>>>>>>>>>>> effort. But if others are interested in contributing a feature >>>>>>>>>>>>>>> like this, >>>>>>>>>>>>>>> I’d see it as valuable for the project and would be happy to >>>>>>>>>>>>>>> collaborate on >>>>>>>>>>>>>>> design/architecture/goals. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jake mentioned 17 months ago a custom FileSystemProvider we >>>>>>>>>>>>>> could offer. >>>>>>>>>>>>>> >>>>>>>>>>>>>> None of us at DataStax has gotten around to providing that, >>>>>>>>>>>>>> but to quickly throw something over the wall this is it: >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java >>>>>>>>>>>>>> >>>>>>>>>>>>>> (with a few friend classes under o.a.c.io.util) >>>>>>>>>>>>>> >>>>>>>>>>>>>> We then have a RemoteStorageProvider, private in another >>>>>>>>>>>>>> repo, that implements that and also provides the >>>>>>>>>>>>>> RemoteFileSystemProvider >>>>>>>>>>>>>> that Jake refers to. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hopefully that's a start to get people thinking about CEP >>>>>>>>>>>>>> level details, while we get a cleaned abstract of >>>>>>>>>>>>>> RemoteStorageProvider and >>>>>>>>>>>>>> friends to offer. >>>>>>>>>>>>>> >>>>>>>>>>>>>>