On Tue, Jan 21, 2025 at 5:30 AM Francisco Guerrero <fran...@apache.org> wrote:
> I think we should evaluate the benefits of the feature you are proposing > independently on how it might be used by Sidecar or other tools. As it > is, it already sounds like a useful functionality to have in the core of > the > Cassandra process. > > Tooling around Cassandra, including Sidecar, can then leverage this > functionality to create snapshots, and then add additional capabilities > on top to perform backups. > > I've added some comments inline below: > > On 2025/01/12 18:25:07 Štefan Miklošovič wrote: > > Hi, > > > > I would like to run this through ML to gather feedback as we are > > contemplating about making this happen. > > > > Currently, snapshots are just hardlinks located in a snapshot directory > to > > live data directory. That is super handy as it occupies virtually zero > disk > > space etc (as long as underlying SSTables are not compacted away, then > > their size would "materialize"). > > > > On the other hand, because it is a hardlink, it is not possible to make > > hard links across block devices (infamous "Invalid cross-device link" > > error). That means that snapshots can ever be located on the very same > disk > > Cassandra has its datadirs on. > > > > Imagine there is a company ABC which has 10 TiB disk (or NFS share) > mounted > > to a Cassandra node and they would like to use that as a cheap / cold > > storage of snapshots. They do not care about the speed of such storage > nor > > they care about how much space it occupies etc. when it comes to > snapshots. > > On the other hand, they do not want to have snapshots occupying a disk > > space where Cassandra has its data because they consider it to be a waste > > of space. They would like to utilize fast disk and disk space for > > production data to the max and snapshots might eat a lot of that space > > unnecessarily. > > > > There might be a configuration property like "snapshot_root_dir: > > /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy > SSTables > > there, but we need to be a little bit smart here (By default, it would > all > > work as it does now - hard links to snapshot directories located under > > Cassandra's data_file_directories.) > > > > Because it is a copy, it occupies disk space. But if we took 100 > snapshots > > on the same SSTables, we would not want to copy the same files 100 times. > > There is a very handy way to prevent this - unique SSTable identifiers > > (under already existing uuid_sstable_identifiers_enabled property) so we > > could have a flat destination hierarchy where all SSTables would be > located > > I have some questions around the flat destination hierarchy. For example, > how > do you keep track of TTLs for different snapshots. What if one snapshot > doesn't > have a TTL and the second does. Those details will need to be worked out. > Of > course, we can discuss these things during implementation of the feature. > There would be a list of files a logical snapshot consists of in a snapshot manifest. We would keep track of what SSTables are in what snapshots. This is not tied to TTL, any two non-expiring snapshots could share the same SSTables. If you go to remove one snapshot and you go to remove a SSTable, you need to check if that particular SSTable is not the part of any other snapshot. If it is, then you can not remove it while removing that snapshot because that table is the part of another one. If you removed it, then you would make the other snapshot corrupt as it would miss that SSTable. This logic is already implemented in Instaclustr Esop (1) (Esop as that Greek guy telling the fables (2)), the tooling we offer for backups and restores against various cloud providers. This stuff was already implemented and I feel confident it might be replicated here but without a ton of baggage which comes from the fact that we need to accommodate specific clouds. I am not saying at all that the code from that tool would end up in Cassandra. No. What I am saying is that we have implemented that logic already and in Cassandra it would be just way simpler. (1) https://github.com/instaclustr/esop (2) https://en.wikipedia.org/wiki/Aesop > > in the same directory and we would just check if such SSTable is already > > there or not before copying it. Snapshot manifests (currently under > > manifest.json) would then contain all SSTables a logical snapshot > consists > > of. > > > > This would be possible only for _user snapshots_. All snapshots taken by > > Cassandra itself (diagnostic snapshots, snapshots upon repairs, snapshots > > against all system tables, ephemeral snapshots) would continue to be hard > > links and it would not be possible to locate them outside of live data > > dirs. > > > > The advantages / characteristics of this approach for user snapshots: > > > > 1. Cassandra will be able to create snapshots located on different > devices. > > 2. From an implementation perspective it would be totally transparent, > > there will be no specific code about "where" we copy. We would just copy, > > from Java perspective, as we copy anywhere else. > > 3. All the tooling would work as it does now - nodetool listsnapshots / > > clearsnapshot / snapshot. Same outputs, same behavior. > > 4. No need to use external tools copying SSTables to desired destination, > > custom scripts, manual synchronisation ... > > 5. Snapshots located outside of Cassandra live data dirs would behave the > > same when it comes to snapshot TTL. (TTL on snapshot means that after so > > and so period of time, they are automatically removed). This logic would > be > > the same. Hence, there is not any need to re-invent a wheel when it comes > > to removing expired snapshots from the operator's perspective. > > 6. Such a solution would deduplicate SSTables so it would be as > > space-efficient as possible (but not as efficient as hardlinks, because > of > > obvious reasons mentioned above). > > > > It seems to me that there is recently a "push" to add more logic to > > Cassandra where it was previously delegated for external toolings, for > > example CEP around automatic repairs are basically doing what external > > tooling does, we just move it under Cassandra. We would love to get rid > of > > a lot of tooling and customly written logic around copying snapshot > > SSTables. From the implementation perspective it would be just plain > Java, > > without any external dependencies etc. There seems to be a lot to gain > for > > relatively straightforward additions to the snapshotting code. > > Agree that there are things that need to move closer to the database > process > where it makes sense. Repair is an obvious one. This change seems > beneficial > as well, and for use cases that do not need to rely on this functionality > the > behavior would remain the same, so I see this as a win. > > > > > We did a serious housekeeping in CASSANDRA-18111 where we consolidated > and > > centralized everything related to snapshot management so we feel > > comfortable to build logic like this on top of that. In fact, > > CASSANDRA-18111 was a prerequisite for this because we did not want to > base > > this work on pre-18111 state of things when it comes to snapshots (it was > > all over the code base, fragmented and duplicated logic etc). > > > > WDYT? > > > > Regards > > >