I feel uneasy about executing scripts from Cassandra. Jon was talking about this here (1) as well. I would not base this on any shell scripts / commands executions. I think nothing beats pure Java copying files to a directory ...
(1) https://lists.apache.org/thread/jcr3mln2tohbckvr8fjrr0sq0syof080 On Thu, Jan 23, 2025 at 5:16 PM Jeremiah Jordan <jeremiah.jor...@gmail.com> wrote: > For commit log archiving we already have the concept of “commands” to be > executed. Maybe a similar concept would be useful for snapshots? Maybe a > new “user snapshot with command” nodetool action could be added. The > server would make its usual hard links inside a snapshot folder and then it > could shell off a new process running the “snapshot archiving command” > passing it the directory just made. Then what ever logic wanted could be > implemented in the command script. Be that copying to S3, or copying to a > folder on another mount point, or what ever the operator wants to happen. > > -Jeremiah > > On Jan 23, 2025 at 7:54:20 AM, Štefan Miklošovič <smikloso...@apache.org> > wrote: > >> Interesting, I will need to think about it more. Thanks for chiming in. >> >> On Wed, Jan 22, 2025 at 8:10 PM Blake Eggleston <beggles...@apple.com> >> wrote: >> >>> Somewhat tangential, but I’d like to see Cassandra provide a backup >>> story that doesn’t involve making copies of sstables. They’re constantly >>> rewritten by compaction, and intelligent backup systems often need to be >>> able to read sstable metadata to optimize storage usage. >>> >>> An interface purpose built to support incremental backup and restore >>> would almost definitely be more efficient since it could account for >>> compaction, and would separate operational requirements from storage layer >>> implementation details. >>> >>> On Jan 22, 2025, at 2:33 AM, Štefan Miklošovič <smikloso...@apache.org> >>> wrote: >>> >>> >>> >>> On Wed, Jan 22, 2025 at 2:21 AM James Berragan <jberra...@gmail.com> >>> wrote: >>> >>>> I think this is an idea worth exploring, my guess is that even if the >>>> scope is confined to just "copy if not exists" it would still largely be >>>> used as a cloud-agnostic backup/restore solution, and so will be shaped >>>> accordingly. >>>> >>>> Some thoughts: >>>> >>>> - I think it would be worth exploring more what the directory structure >>>> looks like. You mention a flat directory hierarchy, but it seems to me it >>>> would need to be delimited by node (or token range) in some way as the >>>> SSTable identifier will not be unique across the cluster. If we do need to >>>> delimit by node, is the configuration burden then on the user to mount >>>> individual drives to S3/Azure/wherever to unique per node paths? What do >>>> they do in the event of a host replacement, backup to a new empty >>>> directory? >>>> >>> >>> It will be unique when "uuid_sstable_identifiers_enabled: true", even >>> across the cluster. If we worked with "old identifiers" too, these are >>> indeed not unique (even across different tables in the same node). I am not >>> completely sure how far we want to go with this, I don't have a problem >>> saying that we support this feature only with >>> "uuid_sstable_identifiers_enabled: true". If we were to support the older >>> SSTable identifier naming as well, that would complicate it more. Esop's >>> directory structure of a remote destination is here: >>> >>> >>> https://github.com/instaclustr/esop?tab=readme-ov-file#directory-structure-of-a-remote-destination >>> >>> and how the content of the snapshot's manifest looks just below it. >>> >>> We may go with hierarchical structure as well if this is evaluated to be >>> a better approach. I just find flat hierarchy simpler. We can not have flat >>> hierarchy with old / non-unique identifiers so we would need to find a way >>> how to differentiate one SSTable from another, which naturally leads to >>> them being placed in keyspace/table/sstable hierarchy but I do not want to >>> complicated it more to have flat and non-flat hierarchies supported >>> simultaneously (where a user could pick which one he wants). We should go >>> just with one solution. >>> >>> When it comes to node replacement, I think that it would be just up to >>> an operator to rename the whole directory to reflect a new path for that >>> particular node. Imagine an operator has a bucket in Azure which is empty >>> (/) and it is mounted to /mnt/nfs/cassandra in every node. Then on node1, >>> Cassandra would automatically start to put SSTables into >>> /mnt/azure/cassandra/cluster-name/dc-name/node-id-1 and node 2 would put >>> that into /mnt/nfs/cassandra/cluster-name/dc-name/node-id-2. >>> >>> The part of "cluster-name/dc-name/node-id" would be automatically done >>> by Cassandra itself. It would just append it to /mnt/nfs/cassandra under >>> which a bucket be mounted. >>> >>> If you replaced the node, data would stay, it would just change node's >>> ID. In that case, all that would need to be necessary would be to rename >>> "node-id-1" directory to "node-id-3" (id-3 being a host id of the replaced >>> node). Snapshot manifest does not know anything about host id so content of >>> the manifest would not need to be changed. If you don't rename the node id >>> directory, then snapshots would be indeed made under a new host id >>> directory which would be empty at first. >>> >>> >>>> - The challenge often with restore is restoring from snapshots created >>>> before a cluster topology change (node replacements, token moves, >>>> cluster expansions/shrinks etc). This could be solved by storing the >>>> snapshot token information in the manifest somewhere. Ideally the user >>>> shouldn't have to scan token information snapshot-wide all SSTables to >>>> determine which ones to restore. >>>> >>> >>> Yes, see the content of the snapshot manifest as I mentioned already >>> (couple lines below the example of directory hierarchy). We are storing >>> "tokens" and "schemaVersion". Each snapshot manifest also contains >>> "schemaContent" with CQL representation of a schema all SSTables in a >>> logical snapshot belong to so an operator knows what was the schema at the >>> time that snapshot was taken plus what were the tokens, plus what was >>> schema version. >>> >>> >>>> >>>> - I didn't understand the TTL mechanism. If we only copy SSTables that >>>> haven't been seen before, some SSTables will exist indefinitely across >>>> snapshots (i.e. L4), while others (in L0) will quickly disappear. There >>>> needs to be a mechanism to determine if the SSTable is expirable (i.e. no >>>> longer exists in active snapshots) by comparing the manifests at the >>>> time of snapshot TTL. >>>> >>> >>> I am not completely sure I get this. What I meant by TTL is that there >>> is a functionality currently in "nodetool snapshot" where you can specify >>> TTL flag which says that in e.g. 1 day, this snapshot will be automatically >>> deleted. I was talking about the scenario when this snapshot is backed up >>> and then after 1 day, we realize that we are going to remove it. That is >>> done by periodically checking, in all manifests of every snapshot, if that >>> snapshot is evaluated as expired or not. If it is, then we just remove that >>> snapshot. >>> >>> Removal of a snapshot means that we just go over every SSTable it >>> logically consists of and check against all other manifests we have if that >>> SSTable is also part of these snapshots or not. If it is not, if that >>> SSTable exists only in that snapshot we go to remove and nowhere else, we >>> can proceed to physically remove that SSTable. If it does exist in other >>> snapshots, then we will not remove it because we would make other snapshots >>> corrupt - pointing to an SSTable which would no longer be there. >>> >>> If I have a snapshot consisting of 5 SSTables, then all these SSTables >>> are compacted into 1 and I make a snapshot again, the second snapshot will >>> consist of 1 SSTable only. When I remove the first snapshot, I can just >>> remove all 5 SSTables, because every single SSTable is not part of any >>> other snapshot. The second snapshot consists of 1 SSTable only which is >>> different from all SSTables found in the first snapshot. >>> >>> >>>> Broadly it sounds like we are saving the operator the burden of >>>> performing snapshot uploads to some cloud service, but there are benefits >>>> (at least from a backup perspective) of performing independently - i.e. >>>> managing bandwidth usage or additional security layers. >>>> >>> >>> Managing bandwidth is an interesting topic. What Esop does is that >>> bandwidth is configurable. You can say how many bytes per second it would >>> upload with or you can say in what time you expect that snapshot to be >>> uploaded. E.g. if we have 10 GiB to upload and you say that you have 5 >>> hours for that, then it will compute how many bytes per second it should >>> upload with. If a cluster is under a lot of stress / talks a lot, we do not >>> want to put even more load on that when it comes to network traffic because >>> of snapshots. Snapshots can be just uploaded as something with lower >>> significance / importance. This might be all done in this work as well, >>> maybe as some follow-up. >>> >>> >>>> >>>> James. >>>> >>>> On Tue, 21 Jan 2025 at 01:05, Štefan Miklošovič <smikloso...@apache.org> >>>> wrote: >>>> >>>>> If you ask specifically about how TTL snapshots are handled, there is >>>>> a thread running with a task scheduled every n seconds (not sure what is >>>>> the default) and it just checks against "expired_at" field in manifest if >>>>> it is expired or not. If it is then it will proceed to delete it as any >>>>> other snapshot. Then the logic I have described above would be in place. >>>>> >>>>> On Tue, Jan 21, 2025 at 10:01 AM Štefan Miklošovič < >>>>> smikloso...@apache.org> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Tue, Jan 21, 2025 at 5:30 AM Francisco Guerrero < >>>>>> fran...@apache.org> wrote: >>>>>> >>>>>>> I think we should evaluate the benefits of the feature you are >>>>>>> proposing >>>>>>> independently on how it might be used by Sidecar or other tools. As >>>>>>> it >>>>>>> is, it already sounds like a useful functionality to have in the >>>>>>> core of the >>>>>>> Cassandra process. >>>>>>> >>>>>>> Tooling around Cassandra, including Sidecar, can then leverage this >>>>>>> functionality to create snapshots, and then add additional >>>>>>> capabilities >>>>>>> on top to perform backups. >>>>>>> >>>>>>> I've added some comments inline below: >>>>>>> >>>>>>> On 2025/01/12 18:25:07 Štefan Miklošovič wrote: >>>>>>> > Hi, >>>>>>> > >>>>>>> > I would like to run this through ML to gather feedback as we are >>>>>>> > contemplating about making this happen. >>>>>>> > >>>>>>> > Currently, snapshots are just hardlinks located in a snapshot >>>>>>> directory to >>>>>>> > live data directory. That is super handy as it occupies virtually >>>>>>> zero disk >>>>>>> > space etc (as long as underlying SSTables are not compacted away, >>>>>>> then >>>>>>> > their size would "materialize"). >>>>>>> > >>>>>>> > On the other hand, because it is a hardlink, it is not possible to >>>>>>> make >>>>>>> > hard links across block devices (infamous "Invalid cross-device >>>>>>> link" >>>>>>> > error). That means that snapshots can ever be located on the very >>>>>>> same disk >>>>>>> > Cassandra has its datadirs on. >>>>>>> > >>>>>>> > Imagine there is a company ABC which has 10 TiB disk (or NFS >>>>>>> share) mounted >>>>>>> > to a Cassandra node and they would like to use that as a cheap / >>>>>>> cold >>>>>>> > storage of snapshots. They do not care about the speed of such >>>>>>> storage nor >>>>>>> > they care about how much space it occupies etc. when it comes to >>>>>>> snapshots. >>>>>>> > On the other hand, they do not want to have snapshots occupying a >>>>>>> disk >>>>>>> > space where Cassandra has its data because they consider it to be >>>>>>> a waste >>>>>>> > of space. They would like to utilize fast disk and disk space for >>>>>>> > production data to the max and snapshots might eat a lot of that >>>>>>> space >>>>>>> > unnecessarily. >>>>>>> > >>>>>>> > There might be a configuration property like "snapshot_root_dir: >>>>>>> > /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy >>>>>>> SSTables >>>>>>> > there, but we need to be a little bit smart here (By default, it >>>>>>> would all >>>>>>> > work as it does now - hard links to snapshot directories located >>>>>>> under >>>>>>> > Cassandra's data_file_directories.) >>>>>>> > >>>>>>> > Because it is a copy, it occupies disk space. But if we took 100 >>>>>>> snapshots >>>>>>> > on the same SSTables, we would not want to copy the same files 100 >>>>>>> times. >>>>>>> > There is a very handy way to prevent this - unique SSTable >>>>>>> identifiers >>>>>>> > (under already existing uuid_sstable_identifiers_enabled property) >>>>>>> so we >>>>>>> > could have a flat destination hierarchy where all SSTables would >>>>>>> be located >>>>>>> >>>>>>> I have some questions around the flat destination hierarchy. For >>>>>>> example, how >>>>>>> do you keep track of TTLs for different snapshots. What if one >>>>>>> snapshot doesn't >>>>>>> have a TTL and the second does. Those details will need to be worked >>>>>>> out. Of >>>>>>> course, we can discuss these things during implementation of the >>>>>>> feature. >>>>>>> >>>>>> >>>>>> There would be a list of files a logical snapshot consists of in a >>>>>> snapshot manifest. We would keep track of what SSTables are in what >>>>>> snapshots. >>>>>> >>>>>> This is not tied to TTL, any two non-expiring snapshots could share >>>>>> the same SSTables. If you go to remove one snapshot and you go to remove >>>>>> a >>>>>> SSTable, you need to check if that particular SSTable is not the part of >>>>>> any other snapshot. If it is, then you can not remove it while removing >>>>>> that snapshot because that table is the part of another one. If you >>>>>> removed >>>>>> it, then you would make the other snapshot corrupt as it would miss that >>>>>> SSTable. >>>>>> >>>>>> This logic is already implemented in Instaclustr Esop (1) (Esop as >>>>>> that Greek guy telling the fables (2)), the tooling we offer for backups >>>>>> and restores against various cloud providers. This stuff was already >>>>>> implemented and I feel confident it might be replicated here but without >>>>>> a >>>>>> ton of baggage which comes from the fact that we need to accommodate >>>>>> specific clouds. I am not saying at all that the code from that tool >>>>>> would >>>>>> end up in Cassandra. No. What I am saying is that we have implemented >>>>>> that >>>>>> logic already and in Cassandra it would be just way simpler. >>>>>> >>>>>> (1) https://github.com/instaclustr/esop >>>>>> (2) https://en.wikipedia.org/wiki/Aesop >>>>>> >>>>>> >>>>>>> > in the same directory and we would just check if such SSTable is >>>>>>> already >>>>>>> > there or not before copying it. Snapshot manifests (currently under >>>>>>> > manifest.json) would then contain all SSTables a logical snapshot >>>>>>> consists >>>>>>> > of. >>>>>>> > >>>>>>> > This would be possible only for _user snapshots_. All snapshots >>>>>>> taken by >>>>>>> > Cassandra itself (diagnostic snapshots, snapshots upon repairs, >>>>>>> snapshots >>>>>>> > against all system tables, ephemeral snapshots) would continue to >>>>>>> be hard >>>>>>> > links and it would not be possible to locate them outside of live >>>>>>> data >>>>>>> > dirs. >>>>>>> > >>>>>>> > The advantages / characteristics of this approach for user >>>>>>> snapshots: >>>>>>> > >>>>>>> > 1. Cassandra will be able to create snapshots located on different >>>>>>> devices. >>>>>>> > 2. From an implementation perspective it would be totally >>>>>>> transparent, >>>>>>> > there will be no specific code about "where" we copy. We would >>>>>>> just copy, >>>>>>> > from Java perspective, as we copy anywhere else. >>>>>>> > 3. All the tooling would work as it does now - nodetool >>>>>>> listsnapshots / >>>>>>> > clearsnapshot / snapshot. Same outputs, same behavior. >>>>>>> > 4. No need to use external tools copying SSTables to desired >>>>>>> destination, >>>>>>> > custom scripts, manual synchronisation ... >>>>>>> > 5. Snapshots located outside of Cassandra live data dirs would >>>>>>> behave the >>>>>>> > same when it comes to snapshot TTL. (TTL on snapshot means that >>>>>>> after so >>>>>>> > and so period of time, they are automatically removed). This logic >>>>>>> would be >>>>>>> > the same. Hence, there is not any need to re-invent a wheel when >>>>>>> it comes >>>>>>> > to removing expired snapshots from the operator's perspective. >>>>>>> > 6. Such a solution would deduplicate SSTables so it would be as >>>>>>> > space-efficient as possible (but not as efficient as hardlinks, >>>>>>> because of >>>>>>> > obvious reasons mentioned above). >>>>>>> > >>>>>>> > It seems to me that there is recently a "push" to add more logic to >>>>>>> > Cassandra where it was previously delegated for external toolings, >>>>>>> for >>>>>>> > example CEP around automatic repairs are basically doing what >>>>>>> external >>>>>>> > tooling does, we just move it under Cassandra. We would love to >>>>>>> get rid of >>>>>>> > a lot of tooling and customly written logic around copying snapshot >>>>>>> > SSTables. From the implementation perspective it would be just >>>>>>> plain Java, >>>>>>> > without any external dependencies etc. There seems to be a lot to >>>>>>> gain for >>>>>>> > relatively straightforward additions to the snapshotting code. >>>>>>> >>>>>>> Agree that there are things that need to move closer to the database >>>>>>> process >>>>>>> where it makes sense. Repair is an obvious one. This change seems >>>>>>> beneficial >>>>>>> as well, and for use cases that do not need to rely on this >>>>>>> functionality the >>>>>>> behavior would remain the same, so I see this as a win. >>>>>>> >>>>>>> > >>>>>>> > We did a serious housekeeping in CASSANDRA-18111 where we >>>>>>> consolidated and >>>>>>> > centralized everything related to snapshot management so we feel >>>>>>> > comfortable to build logic like this on top of that. In fact, >>>>>>> > CASSANDRA-18111 was a prerequisite for this because we did not >>>>>>> want to base >>>>>>> > this work on pre-18111 state of things when it comes to snapshots >>>>>>> (it was >>>>>>> > all over the code base, fragmented and duplicated logic etc). >>>>>>> > >>>>>>> > WDYT? >>>>>>> > >>>>>>> > Regards >>>>>>> > >>>>>>> >>>>>> >>>