Interesting, I will need to think about it more. Thanks for chiming in. On Wed, Jan 22, 2025 at 8:10 PM Blake Eggleston <beggles...@apple.com> wrote:
> Somewhat tangential, but I’d like to see Cassandra provide a backup story > that doesn’t involve making copies of sstables. They’re constantly > rewritten by compaction, and intelligent backup systems often need to be > able to read sstable metadata to optimize storage usage. > > An interface purpose built to support incremental backup and restore would > almost definitely be more efficient since it could account for compaction, > and would separate operational requirements from storage layer > implementation details. > > On Jan 22, 2025, at 2:33 AM, Štefan Miklošovič <smikloso...@apache.org> > wrote: > > > > On Wed, Jan 22, 2025 at 2:21 AM James Berragan <jberra...@gmail.com> > wrote: > >> I think this is an idea worth exploring, my guess is that even if the >> scope is confined to just "copy if not exists" it would still largely be >> used as a cloud-agnostic backup/restore solution, and so will be shaped >> accordingly. >> >> Some thoughts: >> >> - I think it would be worth exploring more what the directory structure >> looks like. You mention a flat directory hierarchy, but it seems to me it >> would need to be delimited by node (or token range) in some way as the >> SSTable identifier will not be unique across the cluster. If we do need to >> delimit by node, is the configuration burden then on the user to mount >> individual drives to S3/Azure/wherever to unique per node paths? What do >> they do in the event of a host replacement, backup to a new empty >> directory? >> > > It will be unique when "uuid_sstable_identifiers_enabled: true", even > across the cluster. If we worked with "old identifiers" too, these are > indeed not unique (even across different tables in the same node). I am not > completely sure how far we want to go with this, I don't have a problem > saying that we support this feature only with > "uuid_sstable_identifiers_enabled: true". If we were to support the older > SSTable identifier naming as well, that would complicate it more. Esop's > directory structure of a remote destination is here: > > > https://github.com/instaclustr/esop?tab=readme-ov-file#directory-structure-of-a-remote-destination > > and how the content of the snapshot's manifest looks just below it. > > We may go with hierarchical structure as well if this is evaluated to be a > better approach. I just find flat hierarchy simpler. We can not have flat > hierarchy with old / non-unique identifiers so we would need to find a way > how to differentiate one SSTable from another, which naturally leads to > them being placed in keyspace/table/sstable hierarchy but I do not want to > complicated it more to have flat and non-flat hierarchies supported > simultaneously (where a user could pick which one he wants). We should go > just with one solution. > > When it comes to node replacement, I think that it would be just up to an > operator to rename the whole directory to reflect a new path for that > particular node. Imagine an operator has a bucket in Azure which is empty > (/) and it is mounted to /mnt/nfs/cassandra in every node. Then on node1, > Cassandra would automatically start to put SSTables into > /mnt/azure/cassandra/cluster-name/dc-name/node-id-1 and node 2 would put > that into /mnt/nfs/cassandra/cluster-name/dc-name/node-id-2. > > The part of "cluster-name/dc-name/node-id" would be automatically done by > Cassandra itself. It would just append it to /mnt/nfs/cassandra under which > a bucket be mounted. > > If you replaced the node, data would stay, it would just change node's ID. > In that case, all that would need to be necessary would be to rename > "node-id-1" directory to "node-id-3" (id-3 being a host id of the replaced > node). Snapshot manifest does not know anything about host id so content of > the manifest would not need to be changed. If you don't rename the node id > directory, then snapshots would be indeed made under a new host id > directory which would be empty at first. > > >> - The challenge often with restore is restoring from snapshots created >> before a cluster topology change (node replacements, token moves, >> cluster expansions/shrinks etc). This could be solved by storing the >> snapshot token information in the manifest somewhere. Ideally the user >> shouldn't have to scan token information snapshot-wide all SSTables to >> determine which ones to restore. >> > > Yes, see the content of the snapshot manifest as I mentioned already > (couple lines below the example of directory hierarchy). We are storing > "tokens" and "schemaVersion". Each snapshot manifest also contains > "schemaContent" with CQL representation of a schema all SSTables in a > logical snapshot belong to so an operator knows what was the schema at the > time that snapshot was taken plus what were the tokens, plus what was > schema version. > > >> >> - I didn't understand the TTL mechanism. If we only copy SSTables that >> haven't been seen before, some SSTables will exist indefinitely across >> snapshots (i.e. L4), while others (in L0) will quickly disappear. There >> needs to be a mechanism to determine if the SSTable is expirable (i.e. no >> longer exists in active snapshots) by comparing the manifests at the >> time of snapshot TTL. >> > > I am not completely sure I get this. What I meant by TTL is that there is > a functionality currently in "nodetool snapshot" where you can specify TTL > flag which says that in e.g. 1 day, this snapshot will be automatically > deleted. I was talking about the scenario when this snapshot is backed up > and then after 1 day, we realize that we are going to remove it. That is > done by periodically checking, in all manifests of every snapshot, if that > snapshot is evaluated as expired or not. If it is, then we just remove that > snapshot. > > Removal of a snapshot means that we just go over every SSTable it > logically consists of and check against all other manifests we have if that > SSTable is also part of these snapshots or not. If it is not, if that > SSTable exists only in that snapshot we go to remove and nowhere else, we > can proceed to physically remove that SSTable. If it does exist in other > snapshots, then we will not remove it because we would make other snapshots > corrupt - pointing to an SSTable which would no longer be there. > > If I have a snapshot consisting of 5 SSTables, then all these SSTables are > compacted into 1 and I make a snapshot again, the second snapshot will > consist of 1 SSTable only. When I remove the first snapshot, I can just > remove all 5 SSTables, because every single SSTable is not part of any > other snapshot. The second snapshot consists of 1 SSTable only which is > different from all SSTables found in the first snapshot. > > >> Broadly it sounds like we are saving the operator the burden of >> performing snapshot uploads to some cloud service, but there are benefits >> (at least from a backup perspective) of performing independently - i.e. >> managing bandwidth usage or additional security layers. >> > > Managing bandwidth is an interesting topic. What Esop does is that > bandwidth is configurable. You can say how many bytes per second it would > upload with or you can say in what time you expect that snapshot to be > uploaded. E.g. if we have 10 GiB to upload and you say that you have 5 > hours for that, then it will compute how many bytes per second it should > upload with. If a cluster is under a lot of stress / talks a lot, we do not > want to put even more load on that when it comes to network traffic because > of snapshots. Snapshots can be just uploaded as something with lower > significance / importance. This might be all done in this work as well, > maybe as some follow-up. > > >> >> James. >> >> On Tue, 21 Jan 2025 at 01:05, Štefan Miklošovič <smikloso...@apache.org> >> wrote: >> >>> If you ask specifically about how TTL snapshots are handled, there is a >>> thread running with a task scheduled every n seconds (not sure what is the >>> default) and it just checks against "expired_at" field in manifest if it is >>> expired or not. If it is then it will proceed to delete it as any other >>> snapshot. Then the logic I have described above would be in place. >>> >>> On Tue, Jan 21, 2025 at 10:01 AM Štefan Miklošovič < >>> smikloso...@apache.org> wrote: >>> >>>> >>>> >>>> On Tue, Jan 21, 2025 at 5:30 AM Francisco Guerrero <fran...@apache.org> >>>> wrote: >>>> >>>>> I think we should evaluate the benefits of the feature you are >>>>> proposing >>>>> independently on how it might be used by Sidecar or other tools. As it >>>>> is, it already sounds like a useful functionality to have in the core >>>>> of the >>>>> Cassandra process. >>>>> >>>>> Tooling around Cassandra, including Sidecar, can then leverage this >>>>> functionality to create snapshots, and then add additional capabilities >>>>> on top to perform backups. >>>>> >>>>> I've added some comments inline below: >>>>> >>>>> On 2025/01/12 18:25:07 Štefan Miklošovič wrote: >>>>> > Hi, >>>>> > >>>>> > I would like to run this through ML to gather feedback as we are >>>>> > contemplating about making this happen. >>>>> > >>>>> > Currently, snapshots are just hardlinks located in a snapshot >>>>> directory to >>>>> > live data directory. That is super handy as it occupies virtually >>>>> zero disk >>>>> > space etc (as long as underlying SSTables are not compacted away, >>>>> then >>>>> > their size would "materialize"). >>>>> > >>>>> > On the other hand, because it is a hardlink, it is not possible to >>>>> make >>>>> > hard links across block devices (infamous "Invalid cross-device link" >>>>> > error). That means that snapshots can ever be located on the very >>>>> same disk >>>>> > Cassandra has its datadirs on. >>>>> > >>>>> > Imagine there is a company ABC which has 10 TiB disk (or NFS share) >>>>> mounted >>>>> > to a Cassandra node and they would like to use that as a cheap / cold >>>>> > storage of snapshots. They do not care about the speed of such >>>>> storage nor >>>>> > they care about how much space it occupies etc. when it comes to >>>>> snapshots. >>>>> > On the other hand, they do not want to have snapshots occupying a >>>>> disk >>>>> > space where Cassandra has its data because they consider it to be a >>>>> waste >>>>> > of space. They would like to utilize fast disk and disk space for >>>>> > production data to the max and snapshots might eat a lot of that >>>>> space >>>>> > unnecessarily. >>>>> > >>>>> > There might be a configuration property like "snapshot_root_dir: >>>>> > /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy >>>>> SSTables >>>>> > there, but we need to be a little bit smart here (By default, it >>>>> would all >>>>> > work as it does now - hard links to snapshot directories located >>>>> under >>>>> > Cassandra's data_file_directories.) >>>>> > >>>>> > Because it is a copy, it occupies disk space. But if we took 100 >>>>> snapshots >>>>> > on the same SSTables, we would not want to copy the same files 100 >>>>> times. >>>>> > There is a very handy way to prevent this - unique SSTable >>>>> identifiers >>>>> > (under already existing uuid_sstable_identifiers_enabled property) >>>>> so we >>>>> > could have a flat destination hierarchy where all SSTables would be >>>>> located >>>>> >>>>> I have some questions around the flat destination hierarchy. For >>>>> example, how >>>>> do you keep track of TTLs for different snapshots. What if one >>>>> snapshot doesn't >>>>> have a TTL and the second does. Those details will need to be worked >>>>> out. Of >>>>> course, we can discuss these things during implementation of the >>>>> feature. >>>>> >>>> >>>> There would be a list of files a logical snapshot consists of in a >>>> snapshot manifest. We would keep track of what SSTables are in what >>>> snapshots. >>>> >>>> This is not tied to TTL, any two non-expiring snapshots could share the >>>> same SSTables. If you go to remove one snapshot and you go to remove a >>>> SSTable, you need to check if that particular SSTable is not the part of >>>> any other snapshot. If it is, then you can not remove it while removing >>>> that snapshot because that table is the part of another one. If you removed >>>> it, then you would make the other snapshot corrupt as it would miss that >>>> SSTable. >>>> >>>> This logic is already implemented in Instaclustr Esop (1) (Esop as that >>>> Greek guy telling the fables (2)), the tooling we offer for backups and >>>> restores against various cloud providers. This stuff was already >>>> implemented and I feel confident it might be replicated here but without a >>>> ton of baggage which comes from the fact that we need to accommodate >>>> specific clouds. I am not saying at all that the code from that tool would >>>> end up in Cassandra. No. What I am saying is that we have implemented that >>>> logic already and in Cassandra it would be just way simpler. >>>> >>>> (1) https://github.com/instaclustr/esop >>>> (2) https://en.wikipedia.org/wiki/Aesop >>>> >>>> >>>>> > in the same directory and we would just check if such SSTable is >>>>> already >>>>> > there or not before copying it. Snapshot manifests (currently under >>>>> > manifest.json) would then contain all SSTables a logical snapshot >>>>> consists >>>>> > of. >>>>> > >>>>> > This would be possible only for _user snapshots_. All snapshots >>>>> taken by >>>>> > Cassandra itself (diagnostic snapshots, snapshots upon repairs, >>>>> snapshots >>>>> > against all system tables, ephemeral snapshots) would continue to be >>>>> hard >>>>> > links and it would not be possible to locate them outside of live >>>>> data >>>>> > dirs. >>>>> > >>>>> > The advantages / characteristics of this approach for user snapshots: >>>>> > >>>>> > 1. Cassandra will be able to create snapshots located on different >>>>> devices. >>>>> > 2. From an implementation perspective it would be totally >>>>> transparent, >>>>> > there will be no specific code about "where" we copy. We would just >>>>> copy, >>>>> > from Java perspective, as we copy anywhere else. >>>>> > 3. All the tooling would work as it does now - nodetool >>>>> listsnapshots / >>>>> > clearsnapshot / snapshot. Same outputs, same behavior. >>>>> > 4. No need to use external tools copying SSTables to desired >>>>> destination, >>>>> > custom scripts, manual synchronisation ... >>>>> > 5. Snapshots located outside of Cassandra live data dirs would >>>>> behave the >>>>> > same when it comes to snapshot TTL. (TTL on snapshot means that >>>>> after so >>>>> > and so period of time, they are automatically removed). This logic >>>>> would be >>>>> > the same. Hence, there is not any need to re-invent a wheel when it >>>>> comes >>>>> > to removing expired snapshots from the operator's perspective. >>>>> > 6. Such a solution would deduplicate SSTables so it would be as >>>>> > space-efficient as possible (but not as efficient as hardlinks, >>>>> because of >>>>> > obvious reasons mentioned above). >>>>> > >>>>> > It seems to me that there is recently a "push" to add more logic to >>>>> > Cassandra where it was previously delegated for external toolings, >>>>> for >>>>> > example CEP around automatic repairs are basically doing what >>>>> external >>>>> > tooling does, we just move it under Cassandra. We would love to get >>>>> rid of >>>>> > a lot of tooling and customly written logic around copying snapshot >>>>> > SSTables. From the implementation perspective it would be just plain >>>>> Java, >>>>> > without any external dependencies etc. There seems to be a lot to >>>>> gain for >>>>> > relatively straightforward additions to the snapshotting code. >>>>> >>>>> Agree that there are things that need to move closer to the database >>>>> process >>>>> where it makes sense. Repair is an obvious one. This change seems >>>>> beneficial >>>>> as well, and for use cases that do not need to rely on this >>>>> functionality the >>>>> behavior would remain the same, so I see this as a win. >>>>> >>>>> > >>>>> > We did a serious housekeeping in CASSANDRA-18111 where we >>>>> consolidated and >>>>> > centralized everything related to snapshot management so we feel >>>>> > comfortable to build logic like this on top of that. In fact, >>>>> > CASSANDRA-18111 was a prerequisite for this because we did not want >>>>> to base >>>>> > this work on pre-18111 state of things when it comes to snapshots >>>>> (it was >>>>> > all over the code base, fragmented and duplicated logic etc). >>>>> > >>>>> > WDYT? >>>>> > >>>>> > Regards >>>>> > >>>>> >>>> >