If you ask specifically about how TTL snapshots are handled, there is a thread running with a task scheduled every n seconds (not sure what is the default) and it just checks against "expired_at" field in manifest if it is expired or not. If it is then it will proceed to delete it as any other snapshot. Then the logic I have described above would be in place.
On Tue, Jan 21, 2025 at 10:01 AM Štefan Miklošovič <smikloso...@apache.org> wrote: > > > On Tue, Jan 21, 2025 at 5:30 AM Francisco Guerrero <fran...@apache.org> > wrote: > >> I think we should evaluate the benefits of the feature you are proposing >> independently on how it might be used by Sidecar or other tools. As it >> is, it already sounds like a useful functionality to have in the core of >> the >> Cassandra process. >> >> Tooling around Cassandra, including Sidecar, can then leverage this >> functionality to create snapshots, and then add additional capabilities >> on top to perform backups. >> >> I've added some comments inline below: >> >> On 2025/01/12 18:25:07 Štefan Miklošovič wrote: >> > Hi, >> > >> > I would like to run this through ML to gather feedback as we are >> > contemplating about making this happen. >> > >> > Currently, snapshots are just hardlinks located in a snapshot directory >> to >> > live data directory. That is super handy as it occupies virtually zero >> disk >> > space etc (as long as underlying SSTables are not compacted away, then >> > their size would "materialize"). >> > >> > On the other hand, because it is a hardlink, it is not possible to make >> > hard links across block devices (infamous "Invalid cross-device link" >> > error). That means that snapshots can ever be located on the very same >> disk >> > Cassandra has its datadirs on. >> > >> > Imagine there is a company ABC which has 10 TiB disk (or NFS share) >> mounted >> > to a Cassandra node and they would like to use that as a cheap / cold >> > storage of snapshots. They do not care about the speed of such storage >> nor >> > they care about how much space it occupies etc. when it comes to >> snapshots. >> > On the other hand, they do not want to have snapshots occupying a disk >> > space where Cassandra has its data because they consider it to be a >> waste >> > of space. They would like to utilize fast disk and disk space for >> > production data to the max and snapshots might eat a lot of that space >> > unnecessarily. >> > >> > There might be a configuration property like "snapshot_root_dir: >> > /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy >> SSTables >> > there, but we need to be a little bit smart here (By default, it would >> all >> > work as it does now - hard links to snapshot directories located under >> > Cassandra's data_file_directories.) >> > >> > Because it is a copy, it occupies disk space. But if we took 100 >> snapshots >> > on the same SSTables, we would not want to copy the same files 100 >> times. >> > There is a very handy way to prevent this - unique SSTable identifiers >> > (under already existing uuid_sstable_identifiers_enabled property) so we >> > could have a flat destination hierarchy where all SSTables would be >> located >> >> I have some questions around the flat destination hierarchy. For example, >> how >> do you keep track of TTLs for different snapshots. What if one snapshot >> doesn't >> have a TTL and the second does. Those details will need to be worked out. >> Of >> course, we can discuss these things during implementation of the feature. >> > > There would be a list of files a logical snapshot consists of in a > snapshot manifest. We would keep track of what SSTables are in what > snapshots. > > This is not tied to TTL, any two non-expiring snapshots could share the > same SSTables. If you go to remove one snapshot and you go to remove a > SSTable, you need to check if that particular SSTable is not the part of > any other snapshot. If it is, then you can not remove it while removing > that snapshot because that table is the part of another one. If you removed > it, then you would make the other snapshot corrupt as it would miss that > SSTable. > > This logic is already implemented in Instaclustr Esop (1) (Esop as that > Greek guy telling the fables (2)), the tooling we offer for backups and > restores against various cloud providers. This stuff was already > implemented and I feel confident it might be replicated here but without a > ton of baggage which comes from the fact that we need to accommodate > specific clouds. I am not saying at all that the code from that tool would > end up in Cassandra. No. What I am saying is that we have implemented that > logic already and in Cassandra it would be just way simpler. > > (1) https://github.com/instaclustr/esop > (2) https://en.wikipedia.org/wiki/Aesop > > >> > in the same directory and we would just check if such SSTable is already >> > there or not before copying it. Snapshot manifests (currently under >> > manifest.json) would then contain all SSTables a logical snapshot >> consists >> > of. >> > >> > This would be possible only for _user snapshots_. All snapshots taken by >> > Cassandra itself (diagnostic snapshots, snapshots upon repairs, >> snapshots >> > against all system tables, ephemeral snapshots) would continue to be >> hard >> > links and it would not be possible to locate them outside of live data >> > dirs. >> > >> > The advantages / characteristics of this approach for user snapshots: >> > >> > 1. Cassandra will be able to create snapshots located on different >> devices. >> > 2. From an implementation perspective it would be totally transparent, >> > there will be no specific code about "where" we copy. We would just >> copy, >> > from Java perspective, as we copy anywhere else. >> > 3. All the tooling would work as it does now - nodetool listsnapshots / >> > clearsnapshot / snapshot. Same outputs, same behavior. >> > 4. No need to use external tools copying SSTables to desired >> destination, >> > custom scripts, manual synchronisation ... >> > 5. Snapshots located outside of Cassandra live data dirs would behave >> the >> > same when it comes to snapshot TTL. (TTL on snapshot means that after so >> > and so period of time, they are automatically removed). This logic >> would be >> > the same. Hence, there is not any need to re-invent a wheel when it >> comes >> > to removing expired snapshots from the operator's perspective. >> > 6. Such a solution would deduplicate SSTables so it would be as >> > space-efficient as possible (but not as efficient as hardlinks, because >> of >> > obvious reasons mentioned above). >> > >> > It seems to me that there is recently a "push" to add more logic to >> > Cassandra where it was previously delegated for external toolings, for >> > example CEP around automatic repairs are basically doing what external >> > tooling does, we just move it under Cassandra. We would love to get rid >> of >> > a lot of tooling and customly written logic around copying snapshot >> > SSTables. From the implementation perspective it would be just plain >> Java, >> > without any external dependencies etc. There seems to be a lot to gain >> for >> > relatively straightforward additions to the snapshotting code. >> >> Agree that there are things that need to move closer to the database >> process >> where it makes sense. Repair is an obvious one. This change seems >> beneficial >> as well, and for use cases that do not need to rely on this functionality >> the >> behavior would remain the same, so I see this as a win. >> >> > >> > We did a serious housekeeping in CASSANDRA-18111 where we consolidated >> and >> > centralized everything related to snapshot management so we feel >> > comfortable to build logic like this on top of that. In fact, >> > CASSANDRA-18111 was a prerequisite for this because we did not want to >> base >> > this work on pre-18111 state of things when it comes to snapshots (it >> was >> > all over the code base, fragmented and duplicated logic etc). >> > >> > WDYT? >> > >> > Regards >> > >> >