Re: [DISCUSS] Snapshots outside of Cassandra data directory

Štefan Miklošovič Thu, 23 Jan 2025 08:25:25 -0800

I feel uneasy about executing scripts from Cassandra. Jon was talking about
this here (1) as well. I would not base this on any shell scripts /
commands executions. I think nothing beats pure Java copying files to a
directory ...


(1) https://lists.apache.org/thread/jcr3mln2tohbckvr8fjrr0sq0syof080

On Thu, Jan 23, 2025 at 5:16 PM Jeremiah Jordan <jeremiah.jor...@gmail.com>
wrote:

> For commit log archiving we already have the concept of “commands” to be
> executed.  Maybe a similar concept would be useful for snapshots?  Maybe a
> new “user snapshot with command” nodetool action could be added.  The
> server would make its usual hard links inside a snapshot folder and then it
> could shell off a new process running the “snapshot archiving command”
> passing it the directory just made.  Then what ever logic wanted could be
> implemented in the command script.  Be that copying to S3, or copying to a
> folder on another mount point, or what ever the operator wants to happen.
>
> -Jeremiah
>
> On Jan 23, 2025 at 7:54:20 AM, Štefan Miklošovič <smikloso...@apache.org>
> wrote:
>
>> Interesting, I will need to think about it more. Thanks for chiming in.
>>
>> On Wed, Jan 22, 2025 at 8:10 PM Blake Eggleston <beggles...@apple.com>
>> wrote:
>>
>>> Somewhat tangential, but I’d like to see Cassandra provide a backup
>>> story that doesn’t involve making copies of sstables. They’re constantly
>>> rewritten by compaction, and intelligent backup systems often need to be
>>> able to read sstable metadata to optimize storage usage.
>>>
>>> An interface purpose built to support incremental backup and restore
>>> would almost definitely be more efficient since it could account for
>>> compaction, and would separate operational requirements from storage layer
>>> implementation details.
>>>
>>> On Jan 22, 2025, at 2:33 AM, Štefan Miklošovič <smikloso...@apache.org>
>>> wrote:
>>>
>>>
>>>
>>> On Wed, Jan 22, 2025 at 2:21 AM James Berragan <jberra...@gmail.com>
>>> wrote:
>>>
>>>> I think this is an idea worth exploring, my guess is that even if the
>>>> scope is confined to just "copy if not exists" it would still largely be
>>>> used as a cloud-agnostic backup/restore solution, and so will be shaped
>>>> accordingly.
>>>>
>>>> Some thoughts:
>>>>
>>>> - I think it would be worth exploring more what the directory structure
>>>> looks like. You mention a flat directory hierarchy, but it seems to me it
>>>> would need to be delimited by node (or token range) in some way as the
>>>> SSTable identifier will not be unique across the cluster. If we do need to
>>>> delimit by node, is the configuration burden then on the user to mount
>>>> individual drives to S3/Azure/wherever to unique per node paths? What do
>>>> they do in the event of a host replacement, backup to a new empty
>>>> directory?
>>>>
>>>
>>> It will be unique when "uuid_sstable_identifiers_enabled: true", even
>>> across the cluster. If we worked with "old identifiers" too, these are
>>> indeed not unique (even across different tables in the same node). I am not
>>> completely sure how far we want to go with this, I don't have a problem
>>> saying that we support this feature only with
>>> "uuid_sstable_identifiers_enabled: true". If we were to support the older
>>> SSTable identifier naming as well, that would complicate it more. Esop's
>>> directory structure of a remote destination is here:
>>>
>>>
>>> https://github.com/instaclustr/esop?tab=readme-ov-file#directory-structure-of-a-remote-destination
>>>
>>> and how the content of the snapshot's manifest looks just below it.
>>>
>>> We may go with hierarchical structure as well if this is evaluated to be
>>> a better approach. I just find flat hierarchy simpler. We can not have flat
>>> hierarchy with old / non-unique identifiers so we would need to find a way
>>> how to differentiate one SSTable from another, which naturally leads to
>>> them being placed in keyspace/table/sstable hierarchy but I do not want to
>>> complicated it more to have flat and non-flat hierarchies supported
>>> simultaneously (where a user could pick which one he wants). We should go
>>> just with one solution.
>>>
>>> When it comes to node replacement, I think that it would be just up to
>>> an operator to rename the whole directory to reflect a new path for that
>>> particular node. Imagine an operator has a bucket in Azure which is empty
>>> (/) and it is mounted to /mnt/nfs/cassandra in every node. Then on node1,
>>> Cassandra would automatically start to put SSTables into
>>> /mnt/azure/cassandra/cluster-name/dc-name/node-id-1 and node 2 would put
>>> that into /mnt/nfs/cassandra/cluster-name/dc-name/node-id-2.
>>>
>>> The part of "cluster-name/dc-name/node-id" would be automatically done
>>> by Cassandra itself. It would just append it to /mnt/nfs/cassandra under
>>> which a bucket be mounted.
>>>
>>> If you replaced the node, data would stay, it would just change node's
>>> ID. In that case, all that would need to be necessary would be to rename
>>> "node-id-1" directory to "node-id-3" (id-3 being a host id of the replaced
>>> node). Snapshot manifest does not know anything about host id so content of
>>> the manifest would not need to be changed. If you don't rename the node id
>>> directory, then snapshots would be indeed made under a new host id
>>> directory which would be empty at first.
>>>
>>>
>>>> - The challenge often with restore is restoring from snapshots created
>>>> before a cluster topology change (node replacements, token moves,
>>>> cluster expansions/shrinks etc). This could be solved by storing the
>>>> snapshot token information in the manifest somewhere. Ideally the user
>>>> shouldn't have to scan token information snapshot-wide all SSTables to
>>>> determine which ones to restore.
>>>>
>>>
>>> Yes, see the content of the snapshot manifest as I mentioned already
>>> (couple lines below the example of directory hierarchy). We are storing
>>> "tokens" and "schemaVersion". Each snapshot manifest also contains
>>> "schemaContent" with CQL representation of a schema all SSTables in a
>>> logical snapshot belong to so an operator knows what was the schema at the
>>> time that snapshot was taken plus what were the tokens, plus what was
>>> schema version.
>>>
>>>
>>>>
>>>> - I didn't understand the TTL mechanism. If we only copy SSTables that
>>>> haven't been seen before, some SSTables will exist indefinitely across
>>>> snapshots (i.e. L4), while others (in L0) will quickly disappear. There
>>>> needs to be a mechanism to determine if the SSTable is expirable (i.e. no
>>>> longer exists in active snapshots) by comparing the manifests at the
>>>> time of snapshot TTL.
>>>>
>>>
>>> I am not completely sure I get this. What I meant by TTL is that there
>>> is a functionality currently in "nodetool snapshot" where you can specify
>>> TTL flag which says that in e.g. 1 day, this snapshot will be automatically
>>> deleted. I was talking about the scenario when this snapshot is backed up
>>> and then after 1 day, we realize that we are going to remove it. That is
>>> done by periodically checking, in all manifests of every snapshot, if that
>>> snapshot is evaluated as expired or not. If it is, then we just remove that
>>> snapshot.
>>>
>>> Removal of a snapshot means that we just go over every SSTable it
>>> logically consists of and check against all other manifests we have if that
>>> SSTable is also part of these snapshots or not. If it is not, if that
>>> SSTable exists only in that snapshot we go to remove and nowhere else, we
>>> can proceed to physically remove that SSTable. If it does exist in other
>>> snapshots, then we will not remove it because we would make other snapshots
>>> corrupt - pointing to an SSTable which would no longer be there.
>>>
>>> If I have a snapshot consisting of 5 SSTables, then all these SSTables
>>> are compacted into 1 and I make a snapshot again, the second snapshot will
>>> consist of 1 SSTable only. When I remove the first snapshot, I can just
>>> remove all 5 SSTables, because every single SSTable is not part of any
>>> other snapshot. The second snapshot consists of 1 SSTable only which is
>>> different from all SSTables found in the first snapshot.
>>>
>>>
>>>> Broadly it sounds like we are saving the operator the burden of
>>>> performing snapshot uploads to some cloud service, but there are benefits
>>>> (at least from a backup perspective) of performing independently - i.e.
>>>> managing bandwidth usage or additional security layers.
>>>>
>>>
>>> Managing bandwidth is an interesting topic. What Esop does is that
>>> bandwidth is configurable. You can say how many bytes per second it would
>>> upload with or you can say in what time you expect that snapshot to be
>>> uploaded. E.g. if we have 10 GiB to upload and you say that you have 5
>>> hours for that, then it will compute how many bytes per second it should
>>> upload with. If a cluster is under a lot of stress / talks a lot, we do not
>>> want to put even more load on that when it comes to network traffic because
>>> of snapshots. Snapshots can be just uploaded as something with lower
>>> significance / importance. This might be all done in this work as well,
>>> maybe as some follow-up.
>>>
>>>
>>>>
>>>> James.
>>>>
>>>> On Tue, 21 Jan 2025 at 01:05, Štefan Miklošovič <smikloso...@apache.org>
>>>> wrote:
>>>>
>>>>> If you ask specifically about how TTL snapshots are handled, there is
>>>>> a thread running with a task scheduled every n seconds (not sure what is
>>>>> the default) and it just checks against "expired_at" field in manifest if
>>>>> it is expired or not. If it is then it will proceed to delete it as any
>>>>> other snapshot. Then the logic I have described above would be in place.
>>>>>
>>>>> On Tue, Jan 21, 2025 at 10:01 AM Štefan Miklošovič <
>>>>> smikloso...@apache.org> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 21, 2025 at 5:30 AM Francisco Guerrero <
>>>>>> fran...@apache.org> wrote:
>>>>>>
>>>>>>> I think we should evaluate the benefits of the feature you are
>>>>>>> proposing
>>>>>>> independently on how it might be used by Sidecar or other tools. As
>>>>>>> it
>>>>>>> is, it already sounds like a useful functionality to have in the
>>>>>>> core of the
>>>>>>> Cassandra process.
>>>>>>>
>>>>>>> Tooling around Cassandra, including Sidecar, can then leverage this
>>>>>>> functionality to create snapshots, and then add additional
>>>>>>> capabilities
>>>>>>> on top to perform backups.
>>>>>>>
>>>>>>> I've added some comments inline below:
>>>>>>>
>>>>>>> On 2025/01/12 18:25:07 Štefan Miklošovič wrote:
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > I would like to run this through ML to gather feedback as we are
>>>>>>> > contemplating about making this happen.
>>>>>>> >
>>>>>>> > Currently, snapshots are just hardlinks located in a snapshot
>>>>>>> directory to
>>>>>>> > live data directory. That is super handy as it occupies virtually
>>>>>>> zero disk
>>>>>>> > space etc (as long as underlying SSTables are not compacted away,
>>>>>>> then
>>>>>>> > their size would "materialize").
>>>>>>> >
>>>>>>> > On the other hand, because it is a hardlink, it is not possible to
>>>>>>> make
>>>>>>> > hard links across block devices (infamous "Invalid cross-device
>>>>>>> link"
>>>>>>> > error). That means that snapshots can ever be located on the very
>>>>>>> same disk
>>>>>>> > Cassandra has its datadirs on.
>>>>>>> >
>>>>>>> > Imagine there is a company ABC which has 10 TiB disk (or NFS
>>>>>>> share) mounted
>>>>>>> > to a Cassandra node and they would like to use that as a cheap /
>>>>>>> cold
>>>>>>> > storage of snapshots. They do not care about the speed of such
>>>>>>> storage nor
>>>>>>> > they care about how much space it occupies etc. when it comes to
>>>>>>> snapshots.
>>>>>>> > On the other hand, they do not want to have snapshots occupying a
>>>>>>> disk
>>>>>>> > space where Cassandra has its data because they consider it to be
>>>>>>> a waste
>>>>>>> > of space. They would like to utilize fast disk and disk space for
>>>>>>> > production data to the max and snapshots might eat a lot of that
>>>>>>> space
>>>>>>> > unnecessarily.
>>>>>>> >
>>>>>>> > There might be a configuration property like "snapshot_root_dir:
>>>>>>> > /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy
>>>>>>> SSTables
>>>>>>> > there, but we need to be a little bit smart here (By default, it
>>>>>>> would all
>>>>>>> > work as it does now - hard links to snapshot directories located
>>>>>>> under
>>>>>>> > Cassandra's data_file_directories.)
>>>>>>> >
>>>>>>> > Because it is a copy, it occupies disk space. But if we took 100
>>>>>>> snapshots
>>>>>>> > on the same SSTables, we would not want to copy the same files 100
>>>>>>> times.
>>>>>>> > There is a very handy way to prevent this - unique SSTable
>>>>>>> identifiers
>>>>>>> > (under already existing uuid_sstable_identifiers_enabled property)
>>>>>>> so we
>>>>>>> > could have a flat destination hierarchy where all SSTables would
>>>>>>> be located
>>>>>>>
>>>>>>> I have some questions around the flat destination hierarchy. For
>>>>>>> example, how
>>>>>>> do you keep track of TTLs for different snapshots. What if one
>>>>>>> snapshot doesn't
>>>>>>> have a TTL and the second does. Those details will need to be worked
>>>>>>> out. Of
>>>>>>> course, we can discuss these things during implementation of the
>>>>>>> feature.
>>>>>>>
>>>>>>
>>>>>> There would be a list of files a logical snapshot consists of in a
>>>>>> snapshot manifest. We would keep track of what SSTables are in what
>>>>>> snapshots.
>>>>>>
>>>>>> This is not tied to TTL, any two non-expiring snapshots could share
>>>>>> the same SSTables. If you go to remove one snapshot and you go to remove 
>>>>>> a
>>>>>> SSTable, you need to check if that particular SSTable is not the part of
>>>>>> any other snapshot. If it is, then you can not remove it while removing
>>>>>> that snapshot because that table is the part of another one. If you 
>>>>>> removed
>>>>>> it, then you would make the other snapshot corrupt as it would miss that
>>>>>> SSTable.
>>>>>>
>>>>>> This logic is already implemented in Instaclustr Esop (1) (Esop as
>>>>>> that Greek guy telling the fables (2)), the tooling we offer for backups
>>>>>> and restores against various cloud providers. This stuff was already
>>>>>> implemented and I feel confident it might be replicated here but without 
>>>>>> a
>>>>>> ton of baggage which comes from the fact that we need to accommodate
>>>>>> specific clouds. I am not saying at all that the code from that tool 
>>>>>> would
>>>>>> end up in Cassandra. No. What I am saying is that we have implemented 
>>>>>> that
>>>>>> logic already and in Cassandra it would be just way simpler.
>>>>>>
>>>>>> (1) https://github.com/instaclustr/esop
>>>>>> (2) https://en.wikipedia.org/wiki/Aesop
>>>>>>
>>>>>>
>>>>>>> > in the same directory and we would just check if such SSTable is
>>>>>>> already
>>>>>>> > there or not before copying it. Snapshot manifests (currently under
>>>>>>> > manifest.json) would then contain all SSTables a logical snapshot
>>>>>>> consists
>>>>>>> > of.
>>>>>>> >
>>>>>>> > This would be possible only for _user snapshots_. All snapshots
>>>>>>> taken by
>>>>>>> > Cassandra itself (diagnostic snapshots, snapshots upon repairs,
>>>>>>> snapshots
>>>>>>> > against all system tables, ephemeral snapshots) would continue to
>>>>>>> be hard
>>>>>>> > links and it would not be possible to locate them outside of live
>>>>>>> data
>>>>>>> > dirs.
>>>>>>> >
>>>>>>> > The advantages / characteristics of this approach for user
>>>>>>> snapshots:
>>>>>>> >
>>>>>>> > 1. Cassandra will be able to create snapshots located on different
>>>>>>> devices.
>>>>>>> > 2. From an implementation perspective it would be totally
>>>>>>> transparent,
>>>>>>> > there will be no specific code about "where" we copy. We would
>>>>>>> just copy,
>>>>>>> > from Java perspective, as we copy anywhere else.
>>>>>>> > 3. All the tooling would work as it does now - nodetool
>>>>>>> listsnapshots /
>>>>>>> > clearsnapshot / snapshot. Same outputs, same behavior.
>>>>>>> > 4. No need to use external tools copying SSTables to desired
>>>>>>> destination,
>>>>>>> > custom scripts, manual synchronisation ...
>>>>>>> > 5. Snapshots located outside of Cassandra live data dirs would
>>>>>>> behave the
>>>>>>> > same when it comes to snapshot TTL. (TTL on snapshot means that
>>>>>>> after so
>>>>>>> > and so period of time, they are automatically removed). This logic
>>>>>>> would be
>>>>>>> > the same. Hence, there is not any need to re-invent a wheel when
>>>>>>> it comes
>>>>>>> > to removing expired snapshots from the operator's perspective.
>>>>>>> > 6. Such a solution would deduplicate SSTables so it would be as
>>>>>>> > space-efficient as possible (but not as efficient as hardlinks,
>>>>>>> because of
>>>>>>> > obvious reasons mentioned above).
>>>>>>> >
>>>>>>> > It seems to me that there is recently a "push" to add more logic to
>>>>>>> > Cassandra where it was previously delegated for external toolings,
>>>>>>> for
>>>>>>> > example CEP around automatic repairs are basically doing what
>>>>>>> external
>>>>>>> > tooling does, we just move it under Cassandra. We would love to
>>>>>>> get rid of
>>>>>>> > a lot of tooling and customly written logic around copying snapshot
>>>>>>> > SSTables. From the implementation perspective it would be just
>>>>>>> plain Java,
>>>>>>> > without any external dependencies etc. There seems to be a lot to
>>>>>>> gain for
>>>>>>> > relatively straightforward additions to the snapshotting code.
>>>>>>>
>>>>>>> Agree that there are things that need to move closer to the database
>>>>>>> process
>>>>>>> where it makes sense. Repair is an obvious one. This change seems
>>>>>>> beneficial
>>>>>>> as well, and for use cases that do not need to rely on this
>>>>>>> functionality the
>>>>>>> behavior would remain the same, so I see this as a win.
>>>>>>>
>>>>>>> >
>>>>>>> > We did a serious housekeeping in CASSANDRA-18111 where we
>>>>>>> consolidated and
>>>>>>> > centralized everything related to snapshot management so we feel
>>>>>>> > comfortable to build logic like this on top of that. In fact,
>>>>>>> > CASSANDRA-18111 was a prerequisite for this because we did not
>>>>>>> want to base
>>>>>>> > this work on pre-18111 state of things when it comes to snapshots
>>>>>>> (it was
>>>>>>> > all over the code base, fragmented and duplicated logic etc).
>>>>>>> >
>>>>>>> > WDYT?
>>>>>>> >
>>>>>>> > Regards
>>>>>>> >
>>>>>>>
>>>>>>
>>>

Re: [DISCUSS] Snapshots outside of Cassandra data directory

Reply via email to