Sorry for going silent on this, I was thinking about this more and what Blake suggested, to have incremental backups somehow integrated, resonated with me. I was trying to figure out how this would all work though.
For the discussion of scripts vs. no scripts, I just do not see how it would be helpful when we are mounting a remote bucket locally and it looks like any other directory. What is the "win" here? If it is about the "flexibility", well, still not persuaded, because all major clouds have some way to mount it locally, what (meaningful) storage one may use is not mountable locally? Let's just keep it realistic. This will be in 99% of cases used for S3 / Azure. Also, I consider scripts which would contain uploading logic to be, actually, harder to use. Where do we make a line as in what is the responsibility of a script and what not? What if script executing commands fails? How are we going to detect this, how are we going to retry? Are we going to do this in shell scripts? How are we going to, theoretically, apply some rate limiting on stuff uploaded by a script? We have very good control over all of that in Java etc ... not in scripts ... Anyway, if we make it robust enough, we may add scripting capabilities later on, I just don't consider it to be a priority. For the actual solution: This should support just a normal snapshot being taken and SSTables uploaded / copied to a remote destination. Sure, the compaction is going to create new SSTables etc. but that is how this works so there is no way around this. For incremental backups, that might be supported too but it is not clear to me yet how it would look like. SSTables which appear in "backups" dir are just hardlinked there, there is no resemblance to any logical "backup" under some name we might reference. An SSTable which appears in "backups" will be basically copied over to S3 or wherever to some bucket ... then what? I think the rule of thumb strategy for the upload is to take a snapshot, e.g. each day, granularity is really a variable here, and then to upload incremental backups until the next snapshot is taken. Hence, upon a restoration, we would restore a snapshot plus we would add all SSTables from "backups" until the desired restoration time. So, by this approach we do not over-upload on compaction because snapshot is taken rather infrequently and we just upload SSTables from "backups" along the way. >From the implementation perspective, snapshotting logic would work as already described and for backups, we would copy to "backups" directory (from Cassandra point of view) instead of hardlinking. However, it is not only about uploading. We are listing these snapshots too e.g. via "nodetool listsnapshots". There is nothing like that for backups. What is really there to list? It will be just a bunch of SSTables in a directory in a bucket, mounted locally. For the restoration purposes, I can imagine there might be "nodetool backups" command to which you would add an argument being a snapshot name like this nodetool backups ks tb my-snapshot <some timestamp> and it would give you the list of SSTables - when my-snapshot would be restored - which you are supposed to copy over to your node to come to some point in time. Example: nodetool snapshot my-snapshot -> that will upload SSTable 01, 02, 03 at time t_0 Then new data is added by SSTable 04, 05, 06, uploaded incrementally, which are created at times t_1, t_2 and t_3 for each. Then, when you do: nodetool backups ks tb my-snapshot t_2 That will output to the console SSTable 04 SSTable 05 04 and 05 are the ones which bring you to t_2 when my-snapshot is restored. When it comes to deletion, for a snapshot we have "nodetool clearsnapshot". What would the deletion for backups look like? You just say what is the timestamp you want all SSTables to be removed from? nodetool clearbackups ks tb t_2 would remove all SSTables in backups which are older than or equal to t_2? This is a rather dangerous command because we might create "holes" between the time a normal snapshot was taken and SSTable in backups which was not deleted. my-snapshot SSTables plus t_3 SSTables would not give meaningful results on restoration because there would be t_1 and t_2 SSTables missing. So, we might remove only these SSTables from "backups" which would be older than any snapshot we have. On Wed, Feb 5, 2025 at 12:14 AM Jon Haddad <rustyrazorbl...@apache.org> wrote: > Fwiw, I don't have a problem with using a shell script. In the email I > sent, I was trying to illustrate how getting to exploiting a shell > vulnerability essentially requires a system that's been completely > compromised already, either through JMX or through CQL (assuming we can > update configs via CQL). > > If someone wanted to do a Java version of the archiving command, I think > that's fine, but there's going to be a lot of valid use cases that aren't > covered by it. I think a lot of operators will just want to be able to pop > in some shell and be done with it. If I'm going to either write a whole > bunch of java or take 3 minutes to call `rclone`, I'm definitely calling > rclone. > > Overall, I like the idea of having a post-snapshot callback. I think the > Java version lets people do it in Java, and also leaves the possibility for > people do it in shell, so it's probably the better fit. > > Jon > > On 2025/01/23 16:25:01 Štefan Miklošovič wrote: > > I feel uneasy about executing scripts from Cassandra. Jon was talking > about > > this here (1) as well. I would not base this on any shell scripts / > > commands executions. I think nothing beats pure Java copying files to a > > directory ... > > > > (1) https://lists.apache.org/thread/jcr3mln2tohbckvr8fjrr0sq0syof080 > > > > On Thu, Jan 23, 2025 at 5:16 PM Jeremiah Jordan < > jeremiah.jor...@gmail.com> > > wrote: > > > > > For commit log archiving we already have the concept of “commands” to > be > > > executed. Maybe a similar concept would be useful for snapshots? > Maybe a > > > new “user snapshot with command” nodetool action could be added. The > > > server would make its usual hard links inside a snapshot folder and > then it > > > could shell off a new process running the “snapshot archiving command” > > > passing it the directory just made. Then what ever logic wanted could > be > > > implemented in the command script. Be that copying to S3, or copying > to a > > > folder on another mount point, or what ever the operator wants to > happen. > > > > > > -Jeremiah > > > > > > On Jan 23, 2025 at 7:54:20 AM, Štefan Miklošovič < > smikloso...@apache.org> > > > wrote: > > > > > >> Interesting, I will need to think about it more. Thanks for chiming > in. > > >> > > >> On Wed, Jan 22, 2025 at 8:10 PM Blake Eggleston <beggles...@apple.com > > > > >> wrote: > > >> > > >>> Somewhat tangential, but I’d like to see Cassandra provide a backup > > >>> story that doesn’t involve making copies of sstables. They’re > constantly > > >>> rewritten by compaction, and intelligent backup systems often need > to be > > >>> able to read sstable metadata to optimize storage usage. > > >>> > > >>> An interface purpose built to support incremental backup and restore > > >>> would almost definitely be more efficient since it could account for > > >>> compaction, and would separate operational requirements from storage > layer > > >>> implementation details. > > >>> > > >>> On Jan 22, 2025, at 2:33 AM, Štefan Miklošovič < > smikloso...@apache.org> > > >>> wrote: > > >>> > > >>> > > >>> > > >>> On Wed, Jan 22, 2025 at 2:21 AM James Berragan <jberra...@gmail.com> > > >>> wrote: > > >>> > > >>>> I think this is an idea worth exploring, my guess is that even if > the > > >>>> scope is confined to just "copy if not exists" it would still > largely be > > >>>> used as a cloud-agnostic backup/restore solution, and so will be > shaped > > >>>> accordingly. > > >>>> > > >>>> Some thoughts: > > >>>> > > >>>> - I think it would be worth exploring more what the directory > structure > > >>>> looks like. You mention a flat directory hierarchy, but it seems to > me it > > >>>> would need to be delimited by node (or token range) in some way as > the > > >>>> SSTable identifier will not be unique across the cluster. If we do > need to > > >>>> delimit by node, is the configuration burden then on the user to > mount > > >>>> individual drives to S3/Azure/wherever to unique per node paths? > What do > > >>>> they do in the event of a host replacement, backup to a new empty > > >>>> directory? > > >>>> > > >>> > > >>> It will be unique when "uuid_sstable_identifiers_enabled: true", even > > >>> across the cluster. If we worked with "old identifiers" too, these > are > > >>> indeed not unique (even across different tables in the same node). I > am not > > >>> completely sure how far we want to go with this, I don't have a > problem > > >>> saying that we support this feature only with > > >>> "uuid_sstable_identifiers_enabled: true". If we were to support the > older > > >>> SSTable identifier naming as well, that would complicate it more. > Esop's > > >>> directory structure of a remote destination is here: > > >>> > > >>> > > >>> > https://github.com/instaclustr/esop?tab=readme-ov-file#directory-structure-of-a-remote-destination > > >>> > > >>> and how the content of the snapshot's manifest looks just below it. > > >>> > > >>> We may go with hierarchical structure as well if this is evaluated > to be > > >>> a better approach. I just find flat hierarchy simpler. We can not > have flat > > >>> hierarchy with old / non-unique identifiers so we would need to find > a way > > >>> how to differentiate one SSTable from another, which naturally leads > to > > >>> them being placed in keyspace/table/sstable hierarchy but I do not > want to > > >>> complicated it more to have flat and non-flat hierarchies supported > > >>> simultaneously (where a user could pick which one he wants). We > should go > > >>> just with one solution. > > >>> > > >>> When it comes to node replacement, I think that it would be just up > to > > >>> an operator to rename the whole directory to reflect a new path for > that > > >>> particular node. Imagine an operator has a bucket in Azure which is > empty > > >>> (/) and it is mounted to /mnt/nfs/cassandra in every node. Then on > node1, > > >>> Cassandra would automatically start to put SSTables into > > >>> /mnt/azure/cassandra/cluster-name/dc-name/node-id-1 and node 2 would > put > > >>> that into /mnt/nfs/cassandra/cluster-name/dc-name/node-id-2. > > >>> > > >>> The part of "cluster-name/dc-name/node-id" would be automatically > done > > >>> by Cassandra itself. It would just append it to /mnt/nfs/cassandra > under > > >>> which a bucket be mounted. > > >>> > > >>> If you replaced the node, data would stay, it would just change > node's > > >>> ID. In that case, all that would need to be necessary would be to > rename > > >>> "node-id-1" directory to "node-id-3" (id-3 being a host id of the > replaced > > >>> node). Snapshot manifest does not know anything about host id so > content of > > >>> the manifest would not need to be changed. If you don't rename the > node id > > >>> directory, then snapshots would be indeed made under a new host id > > >>> directory which would be empty at first. > > >>> > > >>> > > >>>> - The challenge often with restore is restoring from snapshots > created > > >>>> before a cluster topology change (node replacements, token moves, > > >>>> cluster expansions/shrinks etc). This could be solved by storing the > > >>>> snapshot token information in the manifest somewhere. Ideally the > user > > >>>> shouldn't have to scan token information snapshot-wide all SSTables > to > > >>>> determine which ones to restore. > > >>>> > > >>> > > >>> Yes, see the content of the snapshot manifest as I mentioned already > > >>> (couple lines below the example of directory hierarchy). We are > storing > > >>> "tokens" and "schemaVersion". Each snapshot manifest also contains > > >>> "schemaContent" with CQL representation of a schema all SSTables in a > > >>> logical snapshot belong to so an operator knows what was the schema > at the > > >>> time that snapshot was taken plus what were the tokens, plus what was > > >>> schema version. > > >>> > > >>> > > >>>> > > >>>> - I didn't understand the TTL mechanism. If we only copy SSTables > that > > >>>> haven't been seen before, some SSTables will exist indefinitely > across > > >>>> snapshots (i.e. L4), while others (in L0) will quickly disappear. > There > > >>>> needs to be a mechanism to determine if the SSTable is expirable > (i.e. no > > >>>> longer exists in active snapshots) by comparing the manifests at the > > >>>> time of snapshot TTL. > > >>>> > > >>> > > >>> I am not completely sure I get this. What I meant by TTL is that > there > > >>> is a functionality currently in "nodetool snapshot" where you can > specify > > >>> TTL flag which says that in e.g. 1 day, this snapshot will be > automatically > > >>> deleted. I was talking about the scenario when this snapshot is > backed up > > >>> and then after 1 day, we realize that we are going to remove it. > That is > > >>> done by periodically checking, in all manifests of every snapshot, > if that > > >>> snapshot is evaluated as expired or not. If it is, then we just > remove that > > >>> snapshot. > > >>> > > >>> Removal of a snapshot means that we just go over every SSTable it > > >>> logically consists of and check against all other manifests we have > if that > > >>> SSTable is also part of these snapshots or not. If it is not, if that > > >>> SSTable exists only in that snapshot we go to remove and nowhere > else, we > > >>> can proceed to physically remove that SSTable. If it does exist in > other > > >>> snapshots, then we will not remove it because we would make other > snapshots > > >>> corrupt - pointing to an SSTable which would no longer be there. > > >>> > > >>> If I have a snapshot consisting of 5 SSTables, then all these > SSTables > > >>> are compacted into 1 and I make a snapshot again, the second > snapshot will > > >>> consist of 1 SSTable only. When I remove the first snapshot, I can > just > > >>> remove all 5 SSTables, because every single SSTable is not part of > any > > >>> other snapshot. The second snapshot consists of 1 SSTable only which > is > > >>> different from all SSTables found in the first snapshot. > > >>> > > >>> > > >>>> Broadly it sounds like we are saving the operator the burden of > > >>>> performing snapshot uploads to some cloud service, but there are > benefits > > >>>> (at least from a backup perspective) of performing independently - > i.e. > > >>>> managing bandwidth usage or additional security layers. > > >>>> > > >>> > > >>> Managing bandwidth is an interesting topic. What Esop does is that > > >>> bandwidth is configurable. You can say how many bytes per second it > would > > >>> upload with or you can say in what time you expect that snapshot to > be > > >>> uploaded. E.g. if we have 10 GiB to upload and you say that you have > 5 > > >>> hours for that, then it will compute how many bytes per second it > should > > >>> upload with. If a cluster is under a lot of stress / talks a lot, we > do not > > >>> want to put even more load on that when it comes to network traffic > because > > >>> of snapshots. Snapshots can be just uploaded as something with lower > > >>> significance / importance. This might be all done in this work as > well, > > >>> maybe as some follow-up. > > >>> > > >>> > > >>>> > > >>>> James. > > >>>> > > >>>> On Tue, 21 Jan 2025 at 01:05, Štefan Miklošovič < > smikloso...@apache.org> > > >>>> wrote: > > >>>> > > >>>>> If you ask specifically about how TTL snapshots are handled, there > is > > >>>>> a thread running with a task scheduled every n seconds (not sure > what is > > >>>>> the default) and it just checks against "expired_at" field in > manifest if > > >>>>> it is expired or not. If it is then it will proceed to delete it > as any > > >>>>> other snapshot. Then the logic I have described above would be in > place. > > >>>>> > > >>>>> On Tue, Jan 21, 2025 at 10:01 AM Štefan Miklošovič < > > >>>>> smikloso...@apache.org> wrote: > > >>>>> > > >>>>>> > > >>>>>> > > >>>>>> On Tue, Jan 21, 2025 at 5:30 AM Francisco Guerrero < > > >>>>>> fran...@apache.org> wrote: > > >>>>>> > > >>>>>>> I think we should evaluate the benefits of the feature you are > > >>>>>>> proposing > > >>>>>>> independently on how it might be used by Sidecar or other tools. > As > > >>>>>>> it > > >>>>>>> is, it already sounds like a useful functionality to have in the > > >>>>>>> core of the > > >>>>>>> Cassandra process. > > >>>>>>> > > >>>>>>> Tooling around Cassandra, including Sidecar, can then leverage > this > > >>>>>>> functionality to create snapshots, and then add additional > > >>>>>>> capabilities > > >>>>>>> on top to perform backups. > > >>>>>>> > > >>>>>>> I've added some comments inline below: > > >>>>>>> > > >>>>>>> On 2025/01/12 18:25:07 Štefan Miklošovič wrote: > > >>>>>>> > Hi, > > >>>>>>> > > > >>>>>>> > I would like to run this through ML to gather feedback as we > are > > >>>>>>> > contemplating about making this happen. > > >>>>>>> > > > >>>>>>> > Currently, snapshots are just hardlinks located in a snapshot > > >>>>>>> directory to > > >>>>>>> > live data directory. That is super handy as it occupies > virtually > > >>>>>>> zero disk > > >>>>>>> > space etc (as long as underlying SSTables are not compacted > away, > > >>>>>>> then > > >>>>>>> > their size would "materialize"). > > >>>>>>> > > > >>>>>>> > On the other hand, because it is a hardlink, it is not > possible to > > >>>>>>> make > > >>>>>>> > hard links across block devices (infamous "Invalid cross-device > > >>>>>>> link" > > >>>>>>> > error). That means that snapshots can ever be located on the > very > > >>>>>>> same disk > > >>>>>>> > Cassandra has its datadirs on. > > >>>>>>> > > > >>>>>>> > Imagine there is a company ABC which has 10 TiB disk (or NFS > > >>>>>>> share) mounted > > >>>>>>> > to a Cassandra node and they would like to use that as a cheap > / > > >>>>>>> cold > > >>>>>>> > storage of snapshots. They do not care about the speed of such > > >>>>>>> storage nor > > >>>>>>> > they care about how much space it occupies etc. when it comes > to > > >>>>>>> snapshots. > > >>>>>>> > On the other hand, they do not want to have snapshots > occupying a > > >>>>>>> disk > > >>>>>>> > space where Cassandra has its data because they consider it to > be > > >>>>>>> a waste > > >>>>>>> > of space. They would like to utilize fast disk and disk space > for > > >>>>>>> > production data to the max and snapshots might eat a lot of > that > > >>>>>>> space > > >>>>>>> > unnecessarily. > > >>>>>>> > > > >>>>>>> > There might be a configuration property like > "snapshot_root_dir: > > >>>>>>> > /mnt/nfs/cassandra" and if a snapshot is taken, it would just > copy > > >>>>>>> SSTables > > >>>>>>> > there, but we need to be a little bit smart here (By default, > it > > >>>>>>> would all > > >>>>>>> > work as it does now - hard links to snapshot directories > located > > >>>>>>> under > > >>>>>>> > Cassandra's data_file_directories.) > > >>>>>>> > > > >>>>>>> > Because it is a copy, it occupies disk space. But if we took > 100 > > >>>>>>> snapshots > > >>>>>>> > on the same SSTables, we would not want to copy the same files > 100 > > >>>>>>> times. > > >>>>>>> > There is a very handy way to prevent this - unique SSTable > > >>>>>>> identifiers > > >>>>>>> > (under already existing uuid_sstable_identifiers_enabled > property) > > >>>>>>> so we > > >>>>>>> > could have a flat destination hierarchy where all SSTables > would > > >>>>>>> be located > > >>>>>>> > > >>>>>>> I have some questions around the flat destination hierarchy. For > > >>>>>>> example, how > > >>>>>>> do you keep track of TTLs for different snapshots. What if one > > >>>>>>> snapshot doesn't > > >>>>>>> have a TTL and the second does. Those details will need to be > worked > > >>>>>>> out. Of > > >>>>>>> course, we can discuss these things during implementation of the > > >>>>>>> feature. > > >>>>>>> > > >>>>>> > > >>>>>> There would be a list of files a logical snapshot consists of in a > > >>>>>> snapshot manifest. We would keep track of what SSTables are in > what > > >>>>>> snapshots. > > >>>>>> > > >>>>>> This is not tied to TTL, any two non-expiring snapshots could > share > > >>>>>> the same SSTables. If you go to remove one snapshot and you go to > remove a > > >>>>>> SSTable, you need to check if that particular SSTable is not the > part of > > >>>>>> any other snapshot. If it is, then you can not remove it while > removing > > >>>>>> that snapshot because that table is the part of another one. If > you removed > > >>>>>> it, then you would make the other snapshot corrupt as it would > miss that > > >>>>>> SSTable. > > >>>>>> > > >>>>>> This logic is already implemented in Instaclustr Esop (1) (Esop as > > >>>>>> that Greek guy telling the fables (2)), the tooling we offer for > backups > > >>>>>> and restores against various cloud providers. This stuff was > already > > >>>>>> implemented and I feel confident it might be replicated here but > without a > > >>>>>> ton of baggage which comes from the fact that we need to > accommodate > > >>>>>> specific clouds. I am not saying at all that the code from that > tool would > > >>>>>> end up in Cassandra. No. What I am saying is that we have > implemented that > > >>>>>> logic already and in Cassandra it would be just way simpler. > > >>>>>> > > >>>>>> (1) https://github.com/instaclustr/esop > > >>>>>> (2) https://en.wikipedia.org/wiki/Aesop > > >>>>>> > > >>>>>> > > >>>>>>> > in the same directory and we would just check if such SSTable > is > > >>>>>>> already > > >>>>>>> > there or not before copying it. Snapshot manifests (currently > under > > >>>>>>> > manifest.json) would then contain all SSTables a logical > snapshot > > >>>>>>> consists > > >>>>>>> > of. > > >>>>>>> > > > >>>>>>> > This would be possible only for _user snapshots_. All snapshots > > >>>>>>> taken by > > >>>>>>> > Cassandra itself (diagnostic snapshots, snapshots upon repairs, > > >>>>>>> snapshots > > >>>>>>> > against all system tables, ephemeral snapshots) would continue > to > > >>>>>>> be hard > > >>>>>>> > links and it would not be possible to locate them outside of > live > > >>>>>>> data > > >>>>>>> > dirs. > > >>>>>>> > > > >>>>>>> > The advantages / characteristics of this approach for user > > >>>>>>> snapshots: > > >>>>>>> > > > >>>>>>> > 1. Cassandra will be able to create snapshots located on > different > > >>>>>>> devices. > > >>>>>>> > 2. From an implementation perspective it would be totally > > >>>>>>> transparent, > > >>>>>>> > there will be no specific code about "where" we copy. We would > > >>>>>>> just copy, > > >>>>>>> > from Java perspective, as we copy anywhere else. > > >>>>>>> > 3. All the tooling would work as it does now - nodetool > > >>>>>>> listsnapshots / > > >>>>>>> > clearsnapshot / snapshot. Same outputs, same behavior. > > >>>>>>> > 4. No need to use external tools copying SSTables to desired > > >>>>>>> destination, > > >>>>>>> > custom scripts, manual synchronisation ... > > >>>>>>> > 5. Snapshots located outside of Cassandra live data dirs would > > >>>>>>> behave the > > >>>>>>> > same when it comes to snapshot TTL. (TTL on snapshot means that > > >>>>>>> after so > > >>>>>>> > and so period of time, they are automatically removed). This > logic > > >>>>>>> would be > > >>>>>>> > the same. Hence, there is not any need to re-invent a wheel > when > > >>>>>>> it comes > > >>>>>>> > to removing expired snapshots from the operator's perspective. > > >>>>>>> > 6. Such a solution would deduplicate SSTables so it would be as > > >>>>>>> > space-efficient as possible (but not as efficient as hardlinks, > > >>>>>>> because of > > >>>>>>> > obvious reasons mentioned above). > > >>>>>>> > > > >>>>>>> > It seems to me that there is recently a "push" to add more > logic to > > >>>>>>> > Cassandra where it was previously delegated for external > toolings, > > >>>>>>> for > > >>>>>>> > example CEP around automatic repairs are basically doing what > > >>>>>>> external > > >>>>>>> > tooling does, we just move it under Cassandra. We would love to > > >>>>>>> get rid of > > >>>>>>> > a lot of tooling and customly written logic around copying > snapshot > > >>>>>>> > SSTables. From the implementation perspective it would be just > > >>>>>>> plain Java, > > >>>>>>> > without any external dependencies etc. There seems to be a lot > to > > >>>>>>> gain for > > >>>>>>> > relatively straightforward additions to the snapshotting code. > > >>>>>>> > > >>>>>>> Agree that there are things that need to move closer to the > database > > >>>>>>> process > > >>>>>>> where it makes sense. Repair is an obvious one. This change seems > > >>>>>>> beneficial > > >>>>>>> as well, and for use cases that do not need to rely on this > > >>>>>>> functionality the > > >>>>>>> behavior would remain the same, so I see this as a win. > > >>>>>>> > > >>>>>>> > > > >>>>>>> > We did a serious housekeeping in CASSANDRA-18111 where we > > >>>>>>> consolidated and > > >>>>>>> > centralized everything related to snapshot management so we > feel > > >>>>>>> > comfortable to build logic like this on top of that. In fact, > > >>>>>>> > CASSANDRA-18111 was a prerequisite for this because we did not > > >>>>>>> want to base > > >>>>>>> > this work on pre-18111 state of things when it comes to > snapshots > > >>>>>>> (it was > > >>>>>>> > all over the code base, fragmented and duplicated logic etc). > > >>>>>>> > > > >>>>>>> > WDYT? > > >>>>>>> > > > >>>>>>> > Regards > > >>>>>>> > > > >>>>>>> > > >>>>>> > > >>> > > >