Fwiw, I don't have a problem with using a shell script. In the email I sent, I was trying to illustrate how getting to exploiting a shell vulnerability essentially requires a system that's been completely compromised already, either through JMX or through CQL (assuming we can update configs via CQL).
If someone wanted to do a Java version of the archiving command, I think that's fine, but there's going to be a lot of valid use cases that aren't covered by it. I think a lot of operators will just want to be able to pop in some shell and be done with it. If I'm going to either write a whole bunch of java or take 3 minutes to call `rclone`, I'm definitely calling rclone. Overall, I like the idea of having a post-snapshot callback. I think the Java version lets people do it in Java, and also leaves the possibility for people do it in shell, so it's probably the better fit. Jon On 2025/01/23 16:25:01 Štefan Miklošovič wrote: > I feel uneasy about executing scripts from Cassandra. Jon was talking about > this here (1) as well. I would not base this on any shell scripts / > commands executions. I think nothing beats pure Java copying files to a > directory ... > > (1) https://lists.apache.org/thread/jcr3mln2tohbckvr8fjrr0sq0syof080 > > On Thu, Jan 23, 2025 at 5:16 PM Jeremiah Jordan <jeremiah.jor...@gmail.com> > wrote: > > > For commit log archiving we already have the concept of “commands” to be > > executed. Maybe a similar concept would be useful for snapshots? Maybe a > > new “user snapshot with command” nodetool action could be added. The > > server would make its usual hard links inside a snapshot folder and then it > > could shell off a new process running the “snapshot archiving command” > > passing it the directory just made. Then what ever logic wanted could be > > implemented in the command script. Be that copying to S3, or copying to a > > folder on another mount point, or what ever the operator wants to happen. > > > > -Jeremiah > > > > On Jan 23, 2025 at 7:54:20 AM, Štefan Miklošovič <smikloso...@apache.org> > > wrote: > > > >> Interesting, I will need to think about it more. Thanks for chiming in. > >> > >> On Wed, Jan 22, 2025 at 8:10 PM Blake Eggleston <beggles...@apple.com> > >> wrote: > >> > >>> Somewhat tangential, but I’d like to see Cassandra provide a backup > >>> story that doesn’t involve making copies of sstables. They’re constantly > >>> rewritten by compaction, and intelligent backup systems often need to be > >>> able to read sstable metadata to optimize storage usage. > >>> > >>> An interface purpose built to support incremental backup and restore > >>> would almost definitely be more efficient since it could account for > >>> compaction, and would separate operational requirements from storage layer > >>> implementation details. > >>> > >>> On Jan 22, 2025, at 2:33 AM, Štefan Miklošovič <smikloso...@apache.org> > >>> wrote: > >>> > >>> > >>> > >>> On Wed, Jan 22, 2025 at 2:21 AM James Berragan <jberra...@gmail.com> > >>> wrote: > >>> > >>>> I think this is an idea worth exploring, my guess is that even if the > >>>> scope is confined to just "copy if not exists" it would still largely be > >>>> used as a cloud-agnostic backup/restore solution, and so will be shaped > >>>> accordingly. > >>>> > >>>> Some thoughts: > >>>> > >>>> - I think it would be worth exploring more what the directory structure > >>>> looks like. You mention a flat directory hierarchy, but it seems to me it > >>>> would need to be delimited by node (or token range) in some way as the > >>>> SSTable identifier will not be unique across the cluster. If we do need > >>>> to > >>>> delimit by node, is the configuration burden then on the user to mount > >>>> individual drives to S3/Azure/wherever to unique per node paths? What do > >>>> they do in the event of a host replacement, backup to a new empty > >>>> directory? > >>>> > >>> > >>> It will be unique when "uuid_sstable_identifiers_enabled: true", even > >>> across the cluster. If we worked with "old identifiers" too, these are > >>> indeed not unique (even across different tables in the same node). I am > >>> not > >>> completely sure how far we want to go with this, I don't have a problem > >>> saying that we support this feature only with > >>> "uuid_sstable_identifiers_enabled: true". If we were to support the older > >>> SSTable identifier naming as well, that would complicate it more. Esop's > >>> directory structure of a remote destination is here: > >>> > >>> > >>> https://github.com/instaclustr/esop?tab=readme-ov-file#directory-structure-of-a-remote-destination > >>> > >>> and how the content of the snapshot's manifest looks just below it. > >>> > >>> We may go with hierarchical structure as well if this is evaluated to be > >>> a better approach. I just find flat hierarchy simpler. We can not have > >>> flat > >>> hierarchy with old / non-unique identifiers so we would need to find a way > >>> how to differentiate one SSTable from another, which naturally leads to > >>> them being placed in keyspace/table/sstable hierarchy but I do not want to > >>> complicated it more to have flat and non-flat hierarchies supported > >>> simultaneously (where a user could pick which one he wants). We should go > >>> just with one solution. > >>> > >>> When it comes to node replacement, I think that it would be just up to > >>> an operator to rename the whole directory to reflect a new path for that > >>> particular node. Imagine an operator has a bucket in Azure which is empty > >>> (/) and it is mounted to /mnt/nfs/cassandra in every node. Then on node1, > >>> Cassandra would automatically start to put SSTables into > >>> /mnt/azure/cassandra/cluster-name/dc-name/node-id-1 and node 2 would put > >>> that into /mnt/nfs/cassandra/cluster-name/dc-name/node-id-2. > >>> > >>> The part of "cluster-name/dc-name/node-id" would be automatically done > >>> by Cassandra itself. It would just append it to /mnt/nfs/cassandra under > >>> which a bucket be mounted. > >>> > >>> If you replaced the node, data would stay, it would just change node's > >>> ID. In that case, all that would need to be necessary would be to rename > >>> "node-id-1" directory to "node-id-3" (id-3 being a host id of the replaced > >>> node). Snapshot manifest does not know anything about host id so content > >>> of > >>> the manifest would not need to be changed. If you don't rename the node id > >>> directory, then snapshots would be indeed made under a new host id > >>> directory which would be empty at first. > >>> > >>> > >>>> - The challenge often with restore is restoring from snapshots created > >>>> before a cluster topology change (node replacements, token moves, > >>>> cluster expansions/shrinks etc). This could be solved by storing the > >>>> snapshot token information in the manifest somewhere. Ideally the user > >>>> shouldn't have to scan token information snapshot-wide all SSTables to > >>>> determine which ones to restore. > >>>> > >>> > >>> Yes, see the content of the snapshot manifest as I mentioned already > >>> (couple lines below the example of directory hierarchy). We are storing > >>> "tokens" and "schemaVersion". Each snapshot manifest also contains > >>> "schemaContent" with CQL representation of a schema all SSTables in a > >>> logical snapshot belong to so an operator knows what was the schema at the > >>> time that snapshot was taken plus what were the tokens, plus what was > >>> schema version. > >>> > >>> > >>>> > >>>> - I didn't understand the TTL mechanism. If we only copy SSTables that > >>>> haven't been seen before, some SSTables will exist indefinitely across > >>>> snapshots (i.e. L4), while others (in L0) will quickly disappear. There > >>>> needs to be a mechanism to determine if the SSTable is expirable (i.e. no > >>>> longer exists in active snapshots) by comparing the manifests at the > >>>> time of snapshot TTL. > >>>> > >>> > >>> I am not completely sure I get this. What I meant by TTL is that there > >>> is a functionality currently in "nodetool snapshot" where you can specify > >>> TTL flag which says that in e.g. 1 day, this snapshot will be > >>> automatically > >>> deleted. I was talking about the scenario when this snapshot is backed up > >>> and then after 1 day, we realize that we are going to remove it. That is > >>> done by periodically checking, in all manifests of every snapshot, if that > >>> snapshot is evaluated as expired or not. If it is, then we just remove > >>> that > >>> snapshot. > >>> > >>> Removal of a snapshot means that we just go over every SSTable it > >>> logically consists of and check against all other manifests we have if > >>> that > >>> SSTable is also part of these snapshots or not. If it is not, if that > >>> SSTable exists only in that snapshot we go to remove and nowhere else, we > >>> can proceed to physically remove that SSTable. If it does exist in other > >>> snapshots, then we will not remove it because we would make other > >>> snapshots > >>> corrupt - pointing to an SSTable which would no longer be there. > >>> > >>> If I have a snapshot consisting of 5 SSTables, then all these SSTables > >>> are compacted into 1 and I make a snapshot again, the second snapshot will > >>> consist of 1 SSTable only. When I remove the first snapshot, I can just > >>> remove all 5 SSTables, because every single SSTable is not part of any > >>> other snapshot. The second snapshot consists of 1 SSTable only which is > >>> different from all SSTables found in the first snapshot. > >>> > >>> > >>>> Broadly it sounds like we are saving the operator the burden of > >>>> performing snapshot uploads to some cloud service, but there are benefits > >>>> (at least from a backup perspective) of performing independently - i.e. > >>>> managing bandwidth usage or additional security layers. > >>>> > >>> > >>> Managing bandwidth is an interesting topic. What Esop does is that > >>> bandwidth is configurable. You can say how many bytes per second it would > >>> upload with or you can say in what time you expect that snapshot to be > >>> uploaded. E.g. if we have 10 GiB to upload and you say that you have 5 > >>> hours for that, then it will compute how many bytes per second it should > >>> upload with. If a cluster is under a lot of stress / talks a lot, we do > >>> not > >>> want to put even more load on that when it comes to network traffic > >>> because > >>> of snapshots. Snapshots can be just uploaded as something with lower > >>> significance / importance. This might be all done in this work as well, > >>> maybe as some follow-up. > >>> > >>> > >>>> > >>>> James. > >>>> > >>>> On Tue, 21 Jan 2025 at 01:05, Štefan Miklošovič <smikloso...@apache.org> > >>>> wrote: > >>>> > >>>>> If you ask specifically about how TTL snapshots are handled, there is > >>>>> a thread running with a task scheduled every n seconds (not sure what is > >>>>> the default) and it just checks against "expired_at" field in manifest > >>>>> if > >>>>> it is expired or not. If it is then it will proceed to delete it as any > >>>>> other snapshot. Then the logic I have described above would be in place. > >>>>> > >>>>> On Tue, Jan 21, 2025 at 10:01 AM Štefan Miklošovič < > >>>>> smikloso...@apache.org> wrote: > >>>>> > >>>>>> > >>>>>> > >>>>>> On Tue, Jan 21, 2025 at 5:30 AM Francisco Guerrero < > >>>>>> fran...@apache.org> wrote: > >>>>>> > >>>>>>> I think we should evaluate the benefits of the feature you are > >>>>>>> proposing > >>>>>>> independently on how it might be used by Sidecar or other tools. As > >>>>>>> it > >>>>>>> is, it already sounds like a useful functionality to have in the > >>>>>>> core of the > >>>>>>> Cassandra process. > >>>>>>> > >>>>>>> Tooling around Cassandra, including Sidecar, can then leverage this > >>>>>>> functionality to create snapshots, and then add additional > >>>>>>> capabilities > >>>>>>> on top to perform backups. > >>>>>>> > >>>>>>> I've added some comments inline below: > >>>>>>> > >>>>>>> On 2025/01/12 18:25:07 Štefan Miklošovič wrote: > >>>>>>> > Hi, > >>>>>>> > > >>>>>>> > I would like to run this through ML to gather feedback as we are > >>>>>>> > contemplating about making this happen. > >>>>>>> > > >>>>>>> > Currently, snapshots are just hardlinks located in a snapshot > >>>>>>> directory to > >>>>>>> > live data directory. That is super handy as it occupies virtually > >>>>>>> zero disk > >>>>>>> > space etc (as long as underlying SSTables are not compacted away, > >>>>>>> then > >>>>>>> > their size would "materialize"). > >>>>>>> > > >>>>>>> > On the other hand, because it is a hardlink, it is not possible to > >>>>>>> make > >>>>>>> > hard links across block devices (infamous "Invalid cross-device > >>>>>>> link" > >>>>>>> > error). That means that snapshots can ever be located on the very > >>>>>>> same disk > >>>>>>> > Cassandra has its datadirs on. > >>>>>>> > > >>>>>>> > Imagine there is a company ABC which has 10 TiB disk (or NFS > >>>>>>> share) mounted > >>>>>>> > to a Cassandra node and they would like to use that as a cheap / > >>>>>>> cold > >>>>>>> > storage of snapshots. They do not care about the speed of such > >>>>>>> storage nor > >>>>>>> > they care about how much space it occupies etc. when it comes to > >>>>>>> snapshots. > >>>>>>> > On the other hand, they do not want to have snapshots occupying a > >>>>>>> disk > >>>>>>> > space where Cassandra has its data because they consider it to be > >>>>>>> a waste > >>>>>>> > of space. They would like to utilize fast disk and disk space for > >>>>>>> > production data to the max and snapshots might eat a lot of that > >>>>>>> space > >>>>>>> > unnecessarily. > >>>>>>> > > >>>>>>> > There might be a configuration property like "snapshot_root_dir: > >>>>>>> > /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy > >>>>>>> SSTables > >>>>>>> > there, but we need to be a little bit smart here (By default, it > >>>>>>> would all > >>>>>>> > work as it does now - hard links to snapshot directories located > >>>>>>> under > >>>>>>> > Cassandra's data_file_directories.) > >>>>>>> > > >>>>>>> > Because it is a copy, it occupies disk space. But if we took 100 > >>>>>>> snapshots > >>>>>>> > on the same SSTables, we would not want to copy the same files 100 > >>>>>>> times. > >>>>>>> > There is a very handy way to prevent this - unique SSTable > >>>>>>> identifiers > >>>>>>> > (under already existing uuid_sstable_identifiers_enabled property) > >>>>>>> so we > >>>>>>> > could have a flat destination hierarchy where all SSTables would > >>>>>>> be located > >>>>>>> > >>>>>>> I have some questions around the flat destination hierarchy. For > >>>>>>> example, how > >>>>>>> do you keep track of TTLs for different snapshots. What if one > >>>>>>> snapshot doesn't > >>>>>>> have a TTL and the second does. Those details will need to be worked > >>>>>>> out. Of > >>>>>>> course, we can discuss these things during implementation of the > >>>>>>> feature. > >>>>>>> > >>>>>> > >>>>>> There would be a list of files a logical snapshot consists of in a > >>>>>> snapshot manifest. We would keep track of what SSTables are in what > >>>>>> snapshots. > >>>>>> > >>>>>> This is not tied to TTL, any two non-expiring snapshots could share > >>>>>> the same SSTables. If you go to remove one snapshot and you go to > >>>>>> remove a > >>>>>> SSTable, you need to check if that particular SSTable is not the part > >>>>>> of > >>>>>> any other snapshot. If it is, then you can not remove it while removing > >>>>>> that snapshot because that table is the part of another one. If you > >>>>>> removed > >>>>>> it, then you would make the other snapshot corrupt as it would miss > >>>>>> that > >>>>>> SSTable. > >>>>>> > >>>>>> This logic is already implemented in Instaclustr Esop (1) (Esop as > >>>>>> that Greek guy telling the fables (2)), the tooling we offer for > >>>>>> backups > >>>>>> and restores against various cloud providers. This stuff was already > >>>>>> implemented and I feel confident it might be replicated here but > >>>>>> without a > >>>>>> ton of baggage which comes from the fact that we need to accommodate > >>>>>> specific clouds. I am not saying at all that the code from that tool > >>>>>> would > >>>>>> end up in Cassandra. No. What I am saying is that we have implemented > >>>>>> that > >>>>>> logic already and in Cassandra it would be just way simpler. > >>>>>> > >>>>>> (1) https://github.com/instaclustr/esop > >>>>>> (2) https://en.wikipedia.org/wiki/Aesop > >>>>>> > >>>>>> > >>>>>>> > in the same directory and we would just check if such SSTable is > >>>>>>> already > >>>>>>> > there or not before copying it. Snapshot manifests (currently under > >>>>>>> > manifest.json) would then contain all SSTables a logical snapshot > >>>>>>> consists > >>>>>>> > of. > >>>>>>> > > >>>>>>> > This would be possible only for _user snapshots_. All snapshots > >>>>>>> taken by > >>>>>>> > Cassandra itself (diagnostic snapshots, snapshots upon repairs, > >>>>>>> snapshots > >>>>>>> > against all system tables, ephemeral snapshots) would continue to > >>>>>>> be hard > >>>>>>> > links and it would not be possible to locate them outside of live > >>>>>>> data > >>>>>>> > dirs. > >>>>>>> > > >>>>>>> > The advantages / characteristics of this approach for user > >>>>>>> snapshots: > >>>>>>> > > >>>>>>> > 1. Cassandra will be able to create snapshots located on different > >>>>>>> devices. > >>>>>>> > 2. From an implementation perspective it would be totally > >>>>>>> transparent, > >>>>>>> > there will be no specific code about "where" we copy. We would > >>>>>>> just copy, > >>>>>>> > from Java perspective, as we copy anywhere else. > >>>>>>> > 3. All the tooling would work as it does now - nodetool > >>>>>>> listsnapshots / > >>>>>>> > clearsnapshot / snapshot. Same outputs, same behavior. > >>>>>>> > 4. No need to use external tools copying SSTables to desired > >>>>>>> destination, > >>>>>>> > custom scripts, manual synchronisation ... > >>>>>>> > 5. Snapshots located outside of Cassandra live data dirs would > >>>>>>> behave the > >>>>>>> > same when it comes to snapshot TTL. (TTL on snapshot means that > >>>>>>> after so > >>>>>>> > and so period of time, they are automatically removed). This logic > >>>>>>> would be > >>>>>>> > the same. Hence, there is not any need to re-invent a wheel when > >>>>>>> it comes > >>>>>>> > to removing expired snapshots from the operator's perspective. > >>>>>>> > 6. Such a solution would deduplicate SSTables so it would be as > >>>>>>> > space-efficient as possible (but not as efficient as hardlinks, > >>>>>>> because of > >>>>>>> > obvious reasons mentioned above). > >>>>>>> > > >>>>>>> > It seems to me that there is recently a "push" to add more logic to > >>>>>>> > Cassandra where it was previously delegated for external toolings, > >>>>>>> for > >>>>>>> > example CEP around automatic repairs are basically doing what > >>>>>>> external > >>>>>>> > tooling does, we just move it under Cassandra. We would love to > >>>>>>> get rid of > >>>>>>> > a lot of tooling and customly written logic around copying snapshot > >>>>>>> > SSTables. From the implementation perspective it would be just > >>>>>>> plain Java, > >>>>>>> > without any external dependencies etc. There seems to be a lot to > >>>>>>> gain for > >>>>>>> > relatively straightforward additions to the snapshotting code. > >>>>>>> > >>>>>>> Agree that there are things that need to move closer to the database > >>>>>>> process > >>>>>>> where it makes sense. Repair is an obvious one. This change seems > >>>>>>> beneficial > >>>>>>> as well, and for use cases that do not need to rely on this > >>>>>>> functionality the > >>>>>>> behavior would remain the same, so I see this as a win. > >>>>>>> > >>>>>>> > > >>>>>>> > We did a serious housekeeping in CASSANDRA-18111 where we > >>>>>>> consolidated and > >>>>>>> > centralized everything related to snapshot management so we feel > >>>>>>> > comfortable to build logic like this on top of that. In fact, > >>>>>>> > CASSANDRA-18111 was a prerequisite for this because we did not > >>>>>>> want to base > >>>>>>> > this work on pre-18111 state of things when it comes to snapshots > >>>>>>> (it was > >>>>>>> > all over the code base, fragmented and duplicated logic etc). > >>>>>>> > > >>>>>>> > WDYT? > >>>>>>> > > >>>>>>> > Regards > >>>>>>> > > >>>>>>> > >>>>>> > >>> >