On Wed, Jan 22, 2025 at 2:21 AM James Berragan <jberra...@gmail.com> wrote:
> I think this is an idea worth exploring, my guess is that even if the > scope is confined to just "copy if not exists" it would still largely be > used as a cloud-agnostic backup/restore solution, and so will be shaped > accordingly. > > Some thoughts: > > - I think it would be worth exploring more what the directory structure > looks like. You mention a flat directory hierarchy, but it seems to me it > would need to be delimited by node (or token range) in some way as the > SSTable identifier will not be unique across the cluster. If we do need to > delimit by node, is the configuration burden then on the user to mount > individual drives to S3/Azure/wherever to unique per node paths? What do > they do in the event of a host replacement, backup to a new empty > directory? > It will be unique when "uuid_sstable_identifiers_enabled: true", even across the cluster. If we worked with "old identifiers" too, these are indeed not unique (even across different tables in the same node). I am not completely sure how far we want to go with this, I don't have a problem saying that we support this feature only with "uuid_sstable_identifiers_enabled: true". If we were to support the older SSTable identifier naming as well, that would complicate it more. Esop's directory structure of a remote destination is here: https://github.com/instaclustr/esop?tab=readme-ov-file#directory-structure-of-a-remote-destination and how the content of the snapshot's manifest looks just below it. We may go with hierarchical structure as well if this is evaluated to be a better approach. I just find flat hierarchy simpler. We can not have flat hierarchy with old / non-unique identifiers so we would need to find a way how to differentiate one SSTable from another, which naturally leads to them being placed in keyspace/table/sstable hierarchy but I do not want to complicated it more to have flat and non-flat hierarchies supported simultaneously (where a user could pick which one he wants). We should go just with one solution. When it comes to node replacement, I think that it would be just up to an operator to rename the whole directory to reflect a new path for that particular node. Imagine an operator has a bucket in Azure which is empty (/) and it is mounted to /mnt/nfs/cassandra in every node. Then on node1, Cassandra would automatically start to put SSTables into /mnt/azure/cassandra/cluster-name/dc-name/node-id-1 and node 2 would put that into /mnt/nfs/cassandra/cluster-name/dc-name/node-id-2. The part of "cluster-name/dc-name/node-id" would be automatically done by Cassandra itself. It would just append it to /mnt/nfs/cassandra under which a bucket be mounted. If you replaced the node, data would stay, it would just change node's ID. In that case, all that would need to be necessary would be to rename "node-id-1" directory to "node-id-3" (id-3 being a host id of the replaced node). Snapshot manifest does not know anything about host id so content of the manifest would not need to be changed. If you don't rename the node id directory, then snapshots would be indeed made under a new host id directory which would be empty at first. > - The challenge often with restore is restoring from snapshots created > before a cluster topology change (node replacements, token moves, > cluster expansions/shrinks etc). This could be solved by storing the > snapshot token information in the manifest somewhere. Ideally the user > shouldn't have to scan token information snapshot-wide all SSTables to > determine which ones to restore. > Yes, see the content of the snapshot manifest as I mentioned already (couple lines below the example of directory hierarchy). We are storing "tokens" and "schemaVersion". Each snapshot manifest also contains "schemaContent" with CQL representation of a schema all SSTables in a logical snapshot belong to so an operator knows what was the schema at the time that snapshot was taken plus what were the tokens, plus what was schema version. > > - I didn't understand the TTL mechanism. If we only copy SSTables that > haven't been seen before, some SSTables will exist indefinitely across > snapshots (i.e. L4), while others (in L0) will quickly disappear. There > needs to be a mechanism to determine if the SSTable is expirable (i.e. no > longer exists in active snapshots) by comparing the manifests at the time > of snapshot TTL. > I am not completely sure I get this. What I meant by TTL is that there is a functionality currently in "nodetool snapshot" where you can specify TTL flag which says that in e.g. 1 day, this snapshot will be automatically deleted. I was talking about the scenario when this snapshot is backed up and then after 1 day, we realize that we are going to remove it. That is done by periodically checking, in all manifests of every snapshot, if that snapshot is evaluated as expired or not. If it is, then we just remove that snapshot. Removal of a snapshot means that we just go over every SSTable it logically consists of and check against all other manifests we have if that SSTable is also part of these snapshots or not. If it is not, if that SSTable exists only in that snapshot we go to remove and nowhere else, we can proceed to physically remove that SSTable. If it does exist in other snapshots, then we will not remove it because we would make other snapshots corrupt - pointing to an SSTable which would no longer be there. If I have a snapshot consisting of 5 SSTables, then all these SSTables are compacted into 1 and I make a snapshot again, the second snapshot will consist of 1 SSTable only. When I remove the first snapshot, I can just remove all 5 SSTables, because every single SSTable is not part of any other snapshot. The second snapshot consists of 1 SSTable only which is different from all SSTables found in the first snapshot. > Broadly it sounds like we are saving the operator the burden of performing > snapshot uploads to some cloud service, but there are benefits (at least > from a backup perspective) of performing independently - i.e. managing > bandwidth usage or additional security layers. > Managing bandwidth is an interesting topic. What Esop does is that bandwidth is configurable. You can say how many bytes per second it would upload with or you can say in what time you expect that snapshot to be uploaded. E.g. if we have 10 GiB to upload and you say that you have 5 hours for that, then it will compute how many bytes per second it should upload with. If a cluster is under a lot of stress / talks a lot, we do not want to put even more load on that when it comes to network traffic because of snapshots. Snapshots can be just uploaded as something with lower significance / importance. This might be all done in this work as well, maybe as some follow-up. > > James. > > On Tue, 21 Jan 2025 at 01:05, Štefan Miklošovič <smikloso...@apache.org> > wrote: > >> If you ask specifically about how TTL snapshots are handled, there is a >> thread running with a task scheduled every n seconds (not sure what is the >> default) and it just checks against "expired_at" field in manifest if it is >> expired or not. If it is then it will proceed to delete it as any other >> snapshot. Then the logic I have described above would be in place. >> >> On Tue, Jan 21, 2025 at 10:01 AM Štefan Miklošovič < >> smikloso...@apache.org> wrote: >> >>> >>> >>> On Tue, Jan 21, 2025 at 5:30 AM Francisco Guerrero <fran...@apache.org> >>> wrote: >>> >>>> I think we should evaluate the benefits of the feature you are proposing >>>> independently on how it might be used by Sidecar or other tools. As it >>>> is, it already sounds like a useful functionality to have in the core >>>> of the >>>> Cassandra process. >>>> >>>> Tooling around Cassandra, including Sidecar, can then leverage this >>>> functionality to create snapshots, and then add additional capabilities >>>> on top to perform backups. >>>> >>>> I've added some comments inline below: >>>> >>>> On 2025/01/12 18:25:07 Štefan Miklošovič wrote: >>>> > Hi, >>>> > >>>> > I would like to run this through ML to gather feedback as we are >>>> > contemplating about making this happen. >>>> > >>>> > Currently, snapshots are just hardlinks located in a snapshot >>>> directory to >>>> > live data directory. That is super handy as it occupies virtually >>>> zero disk >>>> > space etc (as long as underlying SSTables are not compacted away, then >>>> > their size would "materialize"). >>>> > >>>> > On the other hand, because it is a hardlink, it is not possible to >>>> make >>>> > hard links across block devices (infamous "Invalid cross-device link" >>>> > error). That means that snapshots can ever be located on the very >>>> same disk >>>> > Cassandra has its datadirs on. >>>> > >>>> > Imagine there is a company ABC which has 10 TiB disk (or NFS share) >>>> mounted >>>> > to a Cassandra node and they would like to use that as a cheap / cold >>>> > storage of snapshots. They do not care about the speed of such >>>> storage nor >>>> > they care about how much space it occupies etc. when it comes to >>>> snapshots. >>>> > On the other hand, they do not want to have snapshots occupying a disk >>>> > space where Cassandra has its data because they consider it to be a >>>> waste >>>> > of space. They would like to utilize fast disk and disk space for >>>> > production data to the max and snapshots might eat a lot of that space >>>> > unnecessarily. >>>> > >>>> > There might be a configuration property like "snapshot_root_dir: >>>> > /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy >>>> SSTables >>>> > there, but we need to be a little bit smart here (By default, it >>>> would all >>>> > work as it does now - hard links to snapshot directories located under >>>> > Cassandra's data_file_directories.) >>>> > >>>> > Because it is a copy, it occupies disk space. But if we took 100 >>>> snapshots >>>> > on the same SSTables, we would not want to copy the same files 100 >>>> times. >>>> > There is a very handy way to prevent this - unique SSTable identifiers >>>> > (under already existing uuid_sstable_identifiers_enabled property) so >>>> we >>>> > could have a flat destination hierarchy where all SSTables would be >>>> located >>>> >>>> I have some questions around the flat destination hierarchy. For >>>> example, how >>>> do you keep track of TTLs for different snapshots. What if one snapshot >>>> doesn't >>>> have a TTL and the second does. Those details will need to be worked >>>> out. Of >>>> course, we can discuss these things during implementation of the >>>> feature. >>>> >>> >>> There would be a list of files a logical snapshot consists of in a >>> snapshot manifest. We would keep track of what SSTables are in what >>> snapshots. >>> >>> This is not tied to TTL, any two non-expiring snapshots could share the >>> same SSTables. If you go to remove one snapshot and you go to remove a >>> SSTable, you need to check if that particular SSTable is not the part of >>> any other snapshot. If it is, then you can not remove it while removing >>> that snapshot because that table is the part of another one. If you removed >>> it, then you would make the other snapshot corrupt as it would miss that >>> SSTable. >>> >>> This logic is already implemented in Instaclustr Esop (1) (Esop as that >>> Greek guy telling the fables (2)), the tooling we offer for backups and >>> restores against various cloud providers. This stuff was already >>> implemented and I feel confident it might be replicated here but without a >>> ton of baggage which comes from the fact that we need to accommodate >>> specific clouds. I am not saying at all that the code from that tool would >>> end up in Cassandra. No. What I am saying is that we have implemented that >>> logic already and in Cassandra it would be just way simpler. >>> >>> (1) https://github.com/instaclustr/esop >>> (2) https://en.wikipedia.org/wiki/Aesop >>> >>> >>>> > in the same directory and we would just check if such SSTable is >>>> already >>>> > there or not before copying it. Snapshot manifests (currently under >>>> > manifest.json) would then contain all SSTables a logical snapshot >>>> consists >>>> > of. >>>> > >>>> > This would be possible only for _user snapshots_. All snapshots taken >>>> by >>>> > Cassandra itself (diagnostic snapshots, snapshots upon repairs, >>>> snapshots >>>> > against all system tables, ephemeral snapshots) would continue to be >>>> hard >>>> > links and it would not be possible to locate them outside of live data >>>> > dirs. >>>> > >>>> > The advantages / characteristics of this approach for user snapshots: >>>> > >>>> > 1. Cassandra will be able to create snapshots located on different >>>> devices. >>>> > 2. From an implementation perspective it would be totally transparent, >>>> > there will be no specific code about "where" we copy. We would just >>>> copy, >>>> > from Java perspective, as we copy anywhere else. >>>> > 3. All the tooling would work as it does now - nodetool listsnapshots >>>> / >>>> > clearsnapshot / snapshot. Same outputs, same behavior. >>>> > 4. No need to use external tools copying SSTables to desired >>>> destination, >>>> > custom scripts, manual synchronisation ... >>>> > 5. Snapshots located outside of Cassandra live data dirs would behave >>>> the >>>> > same when it comes to snapshot TTL. (TTL on snapshot means that after >>>> so >>>> > and so period of time, they are automatically removed). This logic >>>> would be >>>> > the same. Hence, there is not any need to re-invent a wheel when it >>>> comes >>>> > to removing expired snapshots from the operator's perspective. >>>> > 6. Such a solution would deduplicate SSTables so it would be as >>>> > space-efficient as possible (but not as efficient as hardlinks, >>>> because of >>>> > obvious reasons mentioned above). >>>> > >>>> > It seems to me that there is recently a "push" to add more logic to >>>> > Cassandra where it was previously delegated for external toolings, for >>>> > example CEP around automatic repairs are basically doing what external >>>> > tooling does, we just move it under Cassandra. We would love to get >>>> rid of >>>> > a lot of tooling and customly written logic around copying snapshot >>>> > SSTables. From the implementation perspective it would be just plain >>>> Java, >>>> > without any external dependencies etc. There seems to be a lot to >>>> gain for >>>> > relatively straightforward additions to the snapshotting code. >>>> >>>> Agree that there are things that need to move closer to the database >>>> process >>>> where it makes sense. Repair is an obvious one. This change seems >>>> beneficial >>>> as well, and for use cases that do not need to rely on this >>>> functionality the >>>> behavior would remain the same, so I see this as a win. >>>> >>>> > >>>> > We did a serious housekeeping in CASSANDRA-18111 where we >>>> consolidated and >>>> > centralized everything related to snapshot management so we feel >>>> > comfortable to build logic like this on top of that. In fact, >>>> > CASSANDRA-18111 was a prerequisite for this because we did not want >>>> to base >>>> > this work on pre-18111 state of things when it comes to snapshots (it >>>> was >>>> > all over the code base, fragmented and duplicated logic etc). >>>> > >>>> > WDYT? >>>> > >>>> > Regards >>>> > >>>> >>>