Re: [DISCUSS] Snapshots outside of Cassandra data directory

Jon Haddad Tue, 04 Feb 2025 15:14:38 -0800

Fwiw, I don't have a problem with using a shell script.  In the email I sent, I 
was trying to illustrate how getting to exploiting a shell vulnerability 
essentially requires a system that's been completely compromised already, 
either through JMX or through CQL (assuming we can update configs via CQL).


If someone wanted to do a Java version of the archiving command, I think that's 
fine, but there's going to be a lot of valid use cases that aren't covered by 
it.  I think a lot of operators will just want to be able to pop in some shell 
and be done with it.  If I'm going to either write a whole bunch of java or 
take 3 minutes to call `rclone`, I'm definitely calling rclone.

Overall, I like the idea of having a post-snapshot callback.  I think the Java 
version lets people do it in Java, and also leaves the possibility for people 
do it in shell, so it's probably the better fit.

Jon

On 2025/01/23 16:25:01 Štefan Miklošovič wrote:
> I feel uneasy about executing scripts from Cassandra. Jon was talking about
> this here (1) as well. I would not base this on any shell scripts /
> commands executions. I think nothing beats pure Java copying files to a
> directory ...
> 
> (1) https://lists.apache.org/thread/jcr3mln2tohbckvr8fjrr0sq0syof080
> 
> On Thu, Jan 23, 2025 at 5:16 PM Jeremiah Jordan <jeremiah.jor...@gmail.com>
> wrote:
> 
> > For commit log archiving we already have the concept of “commands” to be
> > executed.  Maybe a similar concept would be useful for snapshots?  Maybe a
> > new “user snapshot with command” nodetool action could be added.  The
> > server would make its usual hard links inside a snapshot folder and then it
> > could shell off a new process running the “snapshot archiving command”
> > passing it the directory just made.  Then what ever logic wanted could be
> > implemented in the command script.  Be that copying to S3, or copying to a
> > folder on another mount point, or what ever the operator wants to happen.
> >
> > -Jeremiah
> >
> > On Jan 23, 2025 at 7:54:20 AM, Štefan Miklošovič <smikloso...@apache.org>
> > wrote:
> >
> >> Interesting, I will need to think about it more. Thanks for chiming in.
> >>
> >> On Wed, Jan 22, 2025 at 8:10 PM Blake Eggleston <beggles...@apple.com>
> >> wrote:
> >>
> >>> Somewhat tangential, but I’d like to see Cassandra provide a backup
> >>> story that doesn’t involve making copies of sstables. They’re constantly
> >>> rewritten by compaction, and intelligent backup systems often need to be
> >>> able to read sstable metadata to optimize storage usage.
> >>>
> >>> An interface purpose built to support incremental backup and restore
> >>> would almost definitely be more efficient since it could account for
> >>> compaction, and would separate operational requirements from storage layer
> >>> implementation details.
> >>>
> >>> On Jan 22, 2025, at 2:33 AM, Štefan Miklošovič <smikloso...@apache.org>
> >>> wrote:
> >>>
> >>>
> >>>
> >>> On Wed, Jan 22, 2025 at 2:21 AM James Berragan <jberra...@gmail.com>
> >>> wrote:
> >>>
> >>>> I think this is an idea worth exploring, my guess is that even if the
> >>>> scope is confined to just "copy if not exists" it would still largely be
> >>>> used as a cloud-agnostic backup/restore solution, and so will be shaped
> >>>> accordingly.
> >>>>
> >>>> Some thoughts:
> >>>>
> >>>> - I think it would be worth exploring more what the directory structure
> >>>> looks like. You mention a flat directory hierarchy, but it seems to me it
> >>>> would need to be delimited by node (or token range) in some way as the
> >>>> SSTable identifier will not be unique across the cluster. If we do need 
> >>>> to
> >>>> delimit by node, is the configuration burden then on the user to mount
> >>>> individual drives to S3/Azure/wherever to unique per node paths? What do
> >>>> they do in the event of a host replacement, backup to a new empty
> >>>> directory?
> >>>>
> >>>
> >>> It will be unique when "uuid_sstable_identifiers_enabled: true", even
> >>> across the cluster. If we worked with "old identifiers" too, these are
> >>> indeed not unique (even across different tables in the same node). I am 
> >>> not
> >>> completely sure how far we want to go with this, I don't have a problem
> >>> saying that we support this feature only with
> >>> "uuid_sstable_identifiers_enabled: true". If we were to support the older
> >>> SSTable identifier naming as well, that would complicate it more. Esop's
> >>> directory structure of a remote destination is here:
> >>>
> >>>
> >>> https://github.com/instaclustr/esop?tab=readme-ov-file#directory-structure-of-a-remote-destination
> >>>
> >>> and how the content of the snapshot's manifest looks just below it.
> >>>
> >>> We may go with hierarchical structure as well if this is evaluated to be
> >>> a better approach. I just find flat hierarchy simpler. We can not have 
> >>> flat
> >>> hierarchy with old / non-unique identifiers so we would need to find a way
> >>> how to differentiate one SSTable from another, which naturally leads to
> >>> them being placed in keyspace/table/sstable hierarchy but I do not want to
> >>> complicated it more to have flat and non-flat hierarchies supported
> >>> simultaneously (where a user could pick which one he wants). We should go
> >>> just with one solution.
> >>>
> >>> When it comes to node replacement, I think that it would be just up to
> >>> an operator to rename the whole directory to reflect a new path for that
> >>> particular node. Imagine an operator has a bucket in Azure which is empty
> >>> (/) and it is mounted to /mnt/nfs/cassandra in every node. Then on node1,
> >>> Cassandra would automatically start to put SSTables into
> >>> /mnt/azure/cassandra/cluster-name/dc-name/node-id-1 and node 2 would put
> >>> that into /mnt/nfs/cassandra/cluster-name/dc-name/node-id-2.
> >>>
> >>> The part of "cluster-name/dc-name/node-id" would be automatically done
> >>> by Cassandra itself. It would just append it to /mnt/nfs/cassandra under
> >>> which a bucket be mounted.
> >>>
> >>> If you replaced the node, data would stay, it would just change node's
> >>> ID. In that case, all that would need to be necessary would be to rename
> >>> "node-id-1" directory to "node-id-3" (id-3 being a host id of the replaced
> >>> node). Snapshot manifest does not know anything about host id so content 
> >>> of
> >>> the manifest would not need to be changed. If you don't rename the node id
> >>> directory, then snapshots would be indeed made under a new host id
> >>> directory which would be empty at first.
> >>>
> >>>
> >>>> - The challenge often with restore is restoring from snapshots created
> >>>> before a cluster topology change (node replacements, token moves,
> >>>> cluster expansions/shrinks etc). This could be solved by storing the
> >>>> snapshot token information in the manifest somewhere. Ideally the user
> >>>> shouldn't have to scan token information snapshot-wide all SSTables to
> >>>> determine which ones to restore.
> >>>>
> >>>
> >>> Yes, see the content of the snapshot manifest as I mentioned already
> >>> (couple lines below the example of directory hierarchy). We are storing
> >>> "tokens" and "schemaVersion". Each snapshot manifest also contains
> >>> "schemaContent" with CQL representation of a schema all SSTables in a
> >>> logical snapshot belong to so an operator knows what was the schema at the
> >>> time that snapshot was taken plus what were the tokens, plus what was
> >>> schema version.
> >>>
> >>>
> >>>>
> >>>> - I didn't understand the TTL mechanism. If we only copy SSTables that
> >>>> haven't been seen before, some SSTables will exist indefinitely across
> >>>> snapshots (i.e. L4), while others (in L0) will quickly disappear. There
> >>>> needs to be a mechanism to determine if the SSTable is expirable (i.e. no
> >>>> longer exists in active snapshots) by comparing the manifests at the
> >>>> time of snapshot TTL.
> >>>>
> >>>
> >>> I am not completely sure I get this. What I meant by TTL is that there
> >>> is a functionality currently in "nodetool snapshot" where you can specify
> >>> TTL flag which says that in e.g. 1 day, this snapshot will be 
> >>> automatically
> >>> deleted. I was talking about the scenario when this snapshot is backed up
> >>> and then after 1 day, we realize that we are going to remove it. That is
> >>> done by periodically checking, in all manifests of every snapshot, if that
> >>> snapshot is evaluated as expired or not. If it is, then we just remove 
> >>> that
> >>> snapshot.
> >>>
> >>> Removal of a snapshot means that we just go over every SSTable it
> >>> logically consists of and check against all other manifests we have if 
> >>> that
> >>> SSTable is also part of these snapshots or not. If it is not, if that
> >>> SSTable exists only in that snapshot we go to remove and nowhere else, we
> >>> can proceed to physically remove that SSTable. If it does exist in other
> >>> snapshots, then we will not remove it because we would make other 
> >>> snapshots
> >>> corrupt - pointing to an SSTable which would no longer be there.
> >>>
> >>> If I have a snapshot consisting of 5 SSTables, then all these SSTables
> >>> are compacted into 1 and I make a snapshot again, the second snapshot will
> >>> consist of 1 SSTable only. When I remove the first snapshot, I can just
> >>> remove all 5 SSTables, because every single SSTable is not part of any
> >>> other snapshot. The second snapshot consists of 1 SSTable only which is
> >>> different from all SSTables found in the first snapshot.
> >>>
> >>>
> >>>> Broadly it sounds like we are saving the operator the burden of
> >>>> performing snapshot uploads to some cloud service, but there are benefits
> >>>> (at least from a backup perspective) of performing independently - i.e.
> >>>> managing bandwidth usage or additional security layers.
> >>>>
> >>>
> >>> Managing bandwidth is an interesting topic. What Esop does is that
> >>> bandwidth is configurable. You can say how many bytes per second it would
> >>> upload with or you can say in what time you expect that snapshot to be
> >>> uploaded. E.g. if we have 10 GiB to upload and you say that you have 5
> >>> hours for that, then it will compute how many bytes per second it should
> >>> upload with. If a cluster is under a lot of stress / talks a lot, we do 
> >>> not
> >>> want to put even more load on that when it comes to network traffic 
> >>> because
> >>> of snapshots. Snapshots can be just uploaded as something with lower
> >>> significance / importance. This might be all done in this work as well,
> >>> maybe as some follow-up.
> >>>
> >>>
> >>>>
> >>>> James.
> >>>>
> >>>> On Tue, 21 Jan 2025 at 01:05, Štefan Miklošovič <smikloso...@apache.org>
> >>>> wrote:
> >>>>
> >>>>> If you ask specifically about how TTL snapshots are handled, there is
> >>>>> a thread running with a task scheduled every n seconds (not sure what is
> >>>>> the default) and it just checks against "expired_at" field in manifest 
> >>>>> if
> >>>>> it is expired or not. If it is then it will proceed to delete it as any
> >>>>> other snapshot. Then the logic I have described above would be in place.
> >>>>>
> >>>>> On Tue, Jan 21, 2025 at 10:01 AM Štefan Miklošovič <
> >>>>> smikloso...@apache.org> wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Jan 21, 2025 at 5:30 AM Francisco Guerrero <
> >>>>>> fran...@apache.org> wrote:
> >>>>>>
> >>>>>>> I think we should evaluate the benefits of the feature you are
> >>>>>>> proposing
> >>>>>>> independently on how it might be used by Sidecar or other tools. As
> >>>>>>> it
> >>>>>>> is, it already sounds like a useful functionality to have in the
> >>>>>>> core of the
> >>>>>>> Cassandra process.
> >>>>>>>
> >>>>>>> Tooling around Cassandra, including Sidecar, can then leverage this
> >>>>>>> functionality to create snapshots, and then add additional
> >>>>>>> capabilities
> >>>>>>> on top to perform backups.
> >>>>>>>
> >>>>>>> I've added some comments inline below:
> >>>>>>>
> >>>>>>> On 2025/01/12 18:25:07 Štefan Miklošovič wrote:
> >>>>>>> > Hi,
> >>>>>>> >
> >>>>>>> > I would like to run this through ML to gather feedback as we are
> >>>>>>> > contemplating about making this happen.
> >>>>>>> >
> >>>>>>> > Currently, snapshots are just hardlinks located in a snapshot
> >>>>>>> directory to
> >>>>>>> > live data directory. That is super handy as it occupies virtually
> >>>>>>> zero disk
> >>>>>>> > space etc (as long as underlying SSTables are not compacted away,
> >>>>>>> then
> >>>>>>> > their size would "materialize").
> >>>>>>> >
> >>>>>>> > On the other hand, because it is a hardlink, it is not possible to
> >>>>>>> make
> >>>>>>> > hard links across block devices (infamous "Invalid cross-device
> >>>>>>> link"
> >>>>>>> > error). That means that snapshots can ever be located on the very
> >>>>>>> same disk
> >>>>>>> > Cassandra has its datadirs on.
> >>>>>>> >
> >>>>>>> > Imagine there is a company ABC which has 10 TiB disk (or NFS
> >>>>>>> share) mounted
> >>>>>>> > to a Cassandra node and they would like to use that as a cheap /
> >>>>>>> cold
> >>>>>>> > storage of snapshots. They do not care about the speed of such
> >>>>>>> storage nor
> >>>>>>> > they care about how much space it occupies etc. when it comes to
> >>>>>>> snapshots.
> >>>>>>> > On the other hand, they do not want to have snapshots occupying a
> >>>>>>> disk
> >>>>>>> > space where Cassandra has its data because they consider it to be
> >>>>>>> a waste
> >>>>>>> > of space. They would like to utilize fast disk and disk space for
> >>>>>>> > production data to the max and snapshots might eat a lot of that
> >>>>>>> space
> >>>>>>> > unnecessarily.
> >>>>>>> >
> >>>>>>> > There might be a configuration property like "snapshot_root_dir:
> >>>>>>> > /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy
> >>>>>>> SSTables
> >>>>>>> > there, but we need to be a little bit smart here (By default, it
> >>>>>>> would all
> >>>>>>> > work as it does now - hard links to snapshot directories located
> >>>>>>> under
> >>>>>>> > Cassandra's data_file_directories.)
> >>>>>>> >
> >>>>>>> > Because it is a copy, it occupies disk space. But if we took 100
> >>>>>>> snapshots
> >>>>>>> > on the same SSTables, we would not want to copy the same files 100
> >>>>>>> times.
> >>>>>>> > There is a very handy way to prevent this - unique SSTable
> >>>>>>> identifiers
> >>>>>>> > (under already existing uuid_sstable_identifiers_enabled property)
> >>>>>>> so we
> >>>>>>> > could have a flat destination hierarchy where all SSTables would
> >>>>>>> be located
> >>>>>>>
> >>>>>>> I have some questions around the flat destination hierarchy. For
> >>>>>>> example, how
> >>>>>>> do you keep track of TTLs for different snapshots. What if one
> >>>>>>> snapshot doesn't
> >>>>>>> have a TTL and the second does. Those details will need to be worked
> >>>>>>> out. Of
> >>>>>>> course, we can discuss these things during implementation of the
> >>>>>>> feature.
> >>>>>>>
> >>>>>>
> >>>>>> There would be a list of files a logical snapshot consists of in a
> >>>>>> snapshot manifest. We would keep track of what SSTables are in what
> >>>>>> snapshots.
> >>>>>>
> >>>>>> This is not tied to TTL, any two non-expiring snapshots could share
> >>>>>> the same SSTables. If you go to remove one snapshot and you go to 
> >>>>>> remove a
> >>>>>> SSTable, you need to check if that particular SSTable is not the part 
> >>>>>> of
> >>>>>> any other snapshot. If it is, then you can not remove it while removing
> >>>>>> that snapshot because that table is the part of another one. If you 
> >>>>>> removed
> >>>>>> it, then you would make the other snapshot corrupt as it would miss 
> >>>>>> that
> >>>>>> SSTable.
> >>>>>>
> >>>>>> This logic is already implemented in Instaclustr Esop (1) (Esop as
> >>>>>> that Greek guy telling the fables (2)), the tooling we offer for 
> >>>>>> backups
> >>>>>> and restores against various cloud providers. This stuff was already
> >>>>>> implemented and I feel confident it might be replicated here but 
> >>>>>> without a
> >>>>>> ton of baggage which comes from the fact that we need to accommodate
> >>>>>> specific clouds. I am not saying at all that the code from that tool 
> >>>>>> would
> >>>>>> end up in Cassandra. No. What I am saying is that we have implemented 
> >>>>>> that
> >>>>>> logic already and in Cassandra it would be just way simpler.
> >>>>>>
> >>>>>> (1) https://github.com/instaclustr/esop
> >>>>>> (2) https://en.wikipedia.org/wiki/Aesop
> >>>>>>
> >>>>>>
> >>>>>>> > in the same directory and we would just check if such SSTable is
> >>>>>>> already
> >>>>>>> > there or not before copying it. Snapshot manifests (currently under
> >>>>>>> > manifest.json) would then contain all SSTables a logical snapshot
> >>>>>>> consists
> >>>>>>> > of.
> >>>>>>> >
> >>>>>>> > This would be possible only for _user snapshots_. All snapshots
> >>>>>>> taken by
> >>>>>>> > Cassandra itself (diagnostic snapshots, snapshots upon repairs,
> >>>>>>> snapshots
> >>>>>>> > against all system tables, ephemeral snapshots) would continue to
> >>>>>>> be hard
> >>>>>>> > links and it would not be possible to locate them outside of live
> >>>>>>> data
> >>>>>>> > dirs.
> >>>>>>> >
> >>>>>>> > The advantages / characteristics of this approach for user
> >>>>>>> snapshots:
> >>>>>>> >
> >>>>>>> > 1. Cassandra will be able to create snapshots located on different
> >>>>>>> devices.
> >>>>>>> > 2. From an implementation perspective it would be totally
> >>>>>>> transparent,
> >>>>>>> > there will be no specific code about "where" we copy. We would
> >>>>>>> just copy,
> >>>>>>> > from Java perspective, as we copy anywhere else.
> >>>>>>> > 3. All the tooling would work as it does now - nodetool
> >>>>>>> listsnapshots /
> >>>>>>> > clearsnapshot / snapshot. Same outputs, same behavior.
> >>>>>>> > 4. No need to use external tools copying SSTables to desired
> >>>>>>> destination,
> >>>>>>> > custom scripts, manual synchronisation ...
> >>>>>>> > 5. Snapshots located outside of Cassandra live data dirs would
> >>>>>>> behave the
> >>>>>>> > same when it comes to snapshot TTL. (TTL on snapshot means that
> >>>>>>> after so
> >>>>>>> > and so period of time, they are automatically removed). This logic
> >>>>>>> would be
> >>>>>>> > the same. Hence, there is not any need to re-invent a wheel when
> >>>>>>> it comes
> >>>>>>> > to removing expired snapshots from the operator's perspective.
> >>>>>>> > 6. Such a solution would deduplicate SSTables so it would be as
> >>>>>>> > space-efficient as possible (but not as efficient as hardlinks,
> >>>>>>> because of
> >>>>>>> > obvious reasons mentioned above).
> >>>>>>> >
> >>>>>>> > It seems to me that there is recently a "push" to add more logic to
> >>>>>>> > Cassandra where it was previously delegated for external toolings,
> >>>>>>> for
> >>>>>>> > example CEP around automatic repairs are basically doing what
> >>>>>>> external
> >>>>>>> > tooling does, we just move it under Cassandra. We would love to
> >>>>>>> get rid of
> >>>>>>> > a lot of tooling and customly written logic around copying snapshot
> >>>>>>> > SSTables. From the implementation perspective it would be just
> >>>>>>> plain Java,
> >>>>>>> > without any external dependencies etc. There seems to be a lot to
> >>>>>>> gain for
> >>>>>>> > relatively straightforward additions to the snapshotting code.
> >>>>>>>
> >>>>>>> Agree that there are things that need to move closer to the database
> >>>>>>> process
> >>>>>>> where it makes sense. Repair is an obvious one. This change seems
> >>>>>>> beneficial
> >>>>>>> as well, and for use cases that do not need to rely on this
> >>>>>>> functionality the
> >>>>>>> behavior would remain the same, so I see this as a win.
> >>>>>>>
> >>>>>>> >
> >>>>>>> > We did a serious housekeeping in CASSANDRA-18111 where we
> >>>>>>> consolidated and
> >>>>>>> > centralized everything related to snapshot management so we feel
> >>>>>>> > comfortable to build logic like this on top of that. In fact,
> >>>>>>> > CASSANDRA-18111 was a prerequisite for this because we did not
> >>>>>>> want to base
> >>>>>>> > this work on pre-18111 state of things when it comes to snapshots
> >>>>>>> (it was
> >>>>>>> > all over the code base, fragmented and duplicated logic etc).
> >>>>>>> >
> >>>>>>> > WDYT?
> >>>>>>> >
> >>>>>>> > Regards
> >>>>>>> >
> >>>>>>>
> >>>>>>
> >>>
>

Re: [DISCUSS] Snapshots outside of Cassandra data directory

Reply via email to