Re: [DISCUSS] Snapshots outside of Cassandra data directory

Štefan Miklošovič Thu, 23 Jan 2025 05:54:49 -0800

Interesting, I will need to think about it more. Thanks for chiming in.

On Wed, Jan 22, 2025 at 8:10 PM Blake Eggleston <beggles...@apple.com>
wrote:


> Somewhat tangential, but I’d like to see Cassandra provide a backup story
> that doesn’t involve making copies of sstables. They’re constantly
> rewritten by compaction, and intelligent backup systems often need to be
> able to read sstable metadata to optimize storage usage.
>
> An interface purpose built to support incremental backup and restore would
> almost definitely be more efficient since it could account for compaction,
> and would separate operational requirements from storage layer
> implementation details.
>
> On Jan 22, 2025, at 2:33 AM, Štefan Miklošovič <smikloso...@apache.org>
> wrote:
>
>
>
> On Wed, Jan 22, 2025 at 2:21 AM James Berragan <jberra...@gmail.com>
> wrote:
>
>> I think this is an idea worth exploring, my guess is that even if the
>> scope is confined to just "copy if not exists" it would still largely be
>> used as a cloud-agnostic backup/restore solution, and so will be shaped
>> accordingly.
>>
>> Some thoughts:
>>
>> - I think it would be worth exploring more what the directory structure
>> looks like. You mention a flat directory hierarchy, but it seems to me it
>> would need to be delimited by node (or token range) in some way as the
>> SSTable identifier will not be unique across the cluster. If we do need to
>> delimit by node, is the configuration burden then on the user to mount
>> individual drives to S3/Azure/wherever to unique per node paths? What do
>> they do in the event of a host replacement, backup to a new empty
>> directory?
>>
>
> It will be unique when "uuid_sstable_identifiers_enabled: true", even
> across the cluster. If we worked with "old identifiers" too, these are
> indeed not unique (even across different tables in the same node). I am not
> completely sure how far we want to go with this, I don't have a problem
> saying that we support this feature only with
> "uuid_sstable_identifiers_enabled: true". If we were to support the older
> SSTable identifier naming as well, that would complicate it more. Esop's
> directory structure of a remote destination is here:
>
>
> https://github.com/instaclustr/esop?tab=readme-ov-file#directory-structure-of-a-remote-destination
>
> and how the content of the snapshot's manifest looks just below it.
>
> We may go with hierarchical structure as well if this is evaluated to be a
> better approach. I just find flat hierarchy simpler. We can not have flat
> hierarchy with old / non-unique identifiers so we would need to find a way
> how to differentiate one SSTable from another, which naturally leads to
> them being placed in keyspace/table/sstable hierarchy but I do not want to
> complicated it more to have flat and non-flat hierarchies supported
> simultaneously (where a user could pick which one he wants). We should go
> just with one solution.
>
> When it comes to node replacement, I think that it would be just up to an
> operator to rename the whole directory to reflect a new path for that
> particular node. Imagine an operator has a bucket in Azure which is empty
> (/) and it is mounted to /mnt/nfs/cassandra in every node. Then on node1,
> Cassandra would automatically start to put SSTables into
> /mnt/azure/cassandra/cluster-name/dc-name/node-id-1 and node 2 would put
> that into /mnt/nfs/cassandra/cluster-name/dc-name/node-id-2.
>
> The part of "cluster-name/dc-name/node-id" would be automatically done by
> Cassandra itself. It would just append it to /mnt/nfs/cassandra under which
> a bucket be mounted.
>
> If you replaced the node, data would stay, it would just change node's ID.
> In that case, all that would need to be necessary would be to rename
> "node-id-1" directory to "node-id-3" (id-3 being a host id of the replaced
> node). Snapshot manifest does not know anything about host id so content of
> the manifest would not need to be changed. If you don't rename the node id
> directory, then snapshots would be indeed made under a new host id
> directory which would be empty at first.
>
>
>> - The challenge often with restore is restoring from snapshots created
>> before a cluster topology change (node replacements, token moves,
>> cluster expansions/shrinks etc). This could be solved by storing the
>> snapshot token information in the manifest somewhere. Ideally the user
>> shouldn't have to scan token information snapshot-wide all SSTables to
>> determine which ones to restore.
>>
>
> Yes, see the content of the snapshot manifest as I mentioned already
> (couple lines below the example of directory hierarchy). We are storing
> "tokens" and "schemaVersion". Each snapshot manifest also contains
> "schemaContent" with CQL representation of a schema all SSTables in a
> logical snapshot belong to so an operator knows what was the schema at the
> time that snapshot was taken plus what were the tokens, plus what was
> schema version.
>
>
>>
>> - I didn't understand the TTL mechanism. If we only copy SSTables that
>> haven't been seen before, some SSTables will exist indefinitely across
>> snapshots (i.e. L4), while others (in L0) will quickly disappear. There
>> needs to be a mechanism to determine if the SSTable is expirable (i.e. no
>> longer exists in active snapshots) by comparing the manifests at the
>> time of snapshot TTL.
>>
>
> I am not completely sure I get this. What I meant by TTL is that there is
> a functionality currently in "nodetool snapshot" where you can specify TTL
> flag which says that in e.g. 1 day, this snapshot will be automatically
> deleted. I was talking about the scenario when this snapshot is backed up
> and then after 1 day, we realize that we are going to remove it. That is
> done by periodically checking, in all manifests of every snapshot, if that
> snapshot is evaluated as expired or not. If it is, then we just remove that
> snapshot.
>
> Removal of a snapshot means that we just go over every SSTable it
> logically consists of and check against all other manifests we have if that
> SSTable is also part of these snapshots or not. If it is not, if that
> SSTable exists only in that snapshot we go to remove and nowhere else, we
> can proceed to physically remove that SSTable. If it does exist in other
> snapshots, then we will not remove it because we would make other snapshots
> corrupt - pointing to an SSTable which would no longer be there.
>
> If I have a snapshot consisting of 5 SSTables, then all these SSTables are
> compacted into 1 and I make a snapshot again, the second snapshot will
> consist of 1 SSTable only. When I remove the first snapshot, I can just
> remove all 5 SSTables, because every single SSTable is not part of any
> other snapshot. The second snapshot consists of 1 SSTable only which is
> different from all SSTables found in the first snapshot.
>
>
>> Broadly it sounds like we are saving the operator the burden of
>> performing snapshot uploads to some cloud service, but there are benefits
>> (at least from a backup perspective) of performing independently - i.e.
>> managing bandwidth usage or additional security layers.
>>
>
> Managing bandwidth is an interesting topic. What Esop does is that
> bandwidth is configurable. You can say how many bytes per second it would
> upload with or you can say in what time you expect that snapshot to be
> uploaded. E.g. if we have 10 GiB to upload and you say that you have 5
> hours for that, then it will compute how many bytes per second it should
> upload with. If a cluster is under a lot of stress / talks a lot, we do not
> want to put even more load on that when it comes to network traffic because
> of snapshots. Snapshots can be just uploaded as something with lower
> significance / importance. This might be all done in this work as well,
> maybe as some follow-up.
>
>
>>
>> James.
>>
>> On Tue, 21 Jan 2025 at 01:05, Štefan Miklošovič <smikloso...@apache.org>
>> wrote:
>>
>>> If you ask specifically about how TTL snapshots are handled, there is a
>>> thread running with a task scheduled every n seconds (not sure what is the
>>> default) and it just checks against "expired_at" field in manifest if it is
>>> expired or not. If it is then it will proceed to delete it as any other
>>> snapshot. Then the logic I have described above would be in place.
>>>
>>> On Tue, Jan 21, 2025 at 10:01 AM Štefan Miklošovič <
>>> smikloso...@apache.org> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Jan 21, 2025 at 5:30 AM Francisco Guerrero <fran...@apache.org>
>>>> wrote:
>>>>
>>>>> I think we should evaluate the benefits of the feature you are
>>>>> proposing
>>>>> independently on how it might be used by Sidecar or other tools. As it
>>>>> is, it already sounds like a useful functionality to have in the core
>>>>> of the
>>>>> Cassandra process.
>>>>>
>>>>> Tooling around Cassandra, including Sidecar, can then leverage this
>>>>> functionality to create snapshots, and then add additional capabilities
>>>>> on top to perform backups.
>>>>>
>>>>> I've added some comments inline below:
>>>>>
>>>>> On 2025/01/12 18:25:07 Štefan Miklošovič wrote:
>>>>> > Hi,
>>>>> >
>>>>> > I would like to run this through ML to gather feedback as we are
>>>>> > contemplating about making this happen.
>>>>> >
>>>>> > Currently, snapshots are just hardlinks located in a snapshot
>>>>> directory to
>>>>> > live data directory. That is super handy as it occupies virtually
>>>>> zero disk
>>>>> > space etc (as long as underlying SSTables are not compacted away,
>>>>> then
>>>>> > their size would "materialize").
>>>>> >
>>>>> > On the other hand, because it is a hardlink, it is not possible to
>>>>> make
>>>>> > hard links across block devices (infamous "Invalid cross-device link"
>>>>> > error). That means that snapshots can ever be located on the very
>>>>> same disk
>>>>> > Cassandra has its datadirs on.
>>>>> >
>>>>> > Imagine there is a company ABC which has 10 TiB disk (or NFS share)
>>>>> mounted
>>>>> > to a Cassandra node and they would like to use that as a cheap / cold
>>>>> > storage of snapshots. They do not care about the speed of such
>>>>> storage nor
>>>>> > they care about how much space it occupies etc. when it comes to
>>>>> snapshots.
>>>>> > On the other hand, they do not want to have snapshots occupying a
>>>>> disk
>>>>> > space where Cassandra has its data because they consider it to be a
>>>>> waste
>>>>> > of space. They would like to utilize fast disk and disk space for
>>>>> > production data to the max and snapshots might eat a lot of that
>>>>> space
>>>>> > unnecessarily.
>>>>> >
>>>>> > There might be a configuration property like "snapshot_root_dir:
>>>>> > /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy
>>>>> SSTables
>>>>> > there, but we need to be a little bit smart here (By default, it
>>>>> would all
>>>>> > work as it does now - hard links to snapshot directories located
>>>>> under
>>>>> > Cassandra's data_file_directories.)
>>>>> >
>>>>> > Because it is a copy, it occupies disk space. But if we took 100
>>>>> snapshots
>>>>> > on the same SSTables, we would not want to copy the same files 100
>>>>> times.
>>>>> > There is a very handy way to prevent this - unique SSTable
>>>>> identifiers
>>>>> > (under already existing uuid_sstable_identifiers_enabled property)
>>>>> so we
>>>>> > could have a flat destination hierarchy where all SSTables would be
>>>>> located
>>>>>
>>>>> I have some questions around the flat destination hierarchy. For
>>>>> example, how
>>>>> do you keep track of TTLs for different snapshots. What if one
>>>>> snapshot doesn't
>>>>> have a TTL and the second does. Those details will need to be worked
>>>>> out. Of
>>>>> course, we can discuss these things during implementation of the
>>>>> feature.
>>>>>
>>>>
>>>> There would be a list of files a logical snapshot consists of in a
>>>> snapshot manifest. We would keep track of what SSTables are in what
>>>> snapshots.
>>>>
>>>> This is not tied to TTL, any two non-expiring snapshots could share the
>>>> same SSTables. If you go to remove one snapshot and you go to remove a
>>>> SSTable, you need to check if that particular SSTable is not the part of
>>>> any other snapshot. If it is, then you can not remove it while removing
>>>> that snapshot because that table is the part of another one. If you removed
>>>> it, then you would make the other snapshot corrupt as it would miss that
>>>> SSTable.
>>>>
>>>> This logic is already implemented in Instaclustr Esop (1) (Esop as that
>>>> Greek guy telling the fables (2)), the tooling we offer for backups and
>>>> restores against various cloud providers. This stuff was already
>>>> implemented and I feel confident it might be replicated here but without a
>>>> ton of baggage which comes from the fact that we need to accommodate
>>>> specific clouds. I am not saying at all that the code from that tool would
>>>> end up in Cassandra. No. What I am saying is that we have implemented that
>>>> logic already and in Cassandra it would be just way simpler.
>>>>
>>>> (1) https://github.com/instaclustr/esop
>>>> (2) https://en.wikipedia.org/wiki/Aesop
>>>>
>>>>
>>>>> > in the same directory and we would just check if such SSTable is
>>>>> already
>>>>> > there or not before copying it. Snapshot manifests (currently under
>>>>> > manifest.json) would then contain all SSTables a logical snapshot
>>>>> consists
>>>>> > of.
>>>>> >
>>>>> > This would be possible only for _user snapshots_. All snapshots
>>>>> taken by
>>>>> > Cassandra itself (diagnostic snapshots, snapshots upon repairs,
>>>>> snapshots
>>>>> > against all system tables, ephemeral snapshots) would continue to be
>>>>> hard
>>>>> > links and it would not be possible to locate them outside of live
>>>>> data
>>>>> > dirs.
>>>>> >
>>>>> > The advantages / characteristics of this approach for user snapshots:
>>>>> >
>>>>> > 1. Cassandra will be able to create snapshots located on different
>>>>> devices.
>>>>> > 2. From an implementation perspective it would be totally
>>>>> transparent,
>>>>> > there will be no specific code about "where" we copy. We would just
>>>>> copy,
>>>>> > from Java perspective, as we copy anywhere else.
>>>>> > 3. All the tooling would work as it does now - nodetool
>>>>> listsnapshots /
>>>>> > clearsnapshot / snapshot. Same outputs, same behavior.
>>>>> > 4. No need to use external tools copying SSTables to desired
>>>>> destination,
>>>>> > custom scripts, manual synchronisation ...
>>>>> > 5. Snapshots located outside of Cassandra live data dirs would
>>>>> behave the
>>>>> > same when it comes to snapshot TTL. (TTL on snapshot means that
>>>>> after so
>>>>> > and so period of time, they are automatically removed). This logic
>>>>> would be
>>>>> > the same. Hence, there is not any need to re-invent a wheel when it
>>>>> comes
>>>>> > to removing expired snapshots from the operator's perspective.
>>>>> > 6. Such a solution would deduplicate SSTables so it would be as
>>>>> > space-efficient as possible (but not as efficient as hardlinks,
>>>>> because of
>>>>> > obvious reasons mentioned above).
>>>>> >
>>>>> > It seems to me that there is recently a "push" to add more logic to
>>>>> > Cassandra where it was previously delegated for external toolings,
>>>>> for
>>>>> > example CEP around automatic repairs are basically doing what
>>>>> external
>>>>> > tooling does, we just move it under Cassandra. We would love to get
>>>>> rid of
>>>>> > a lot of tooling and customly written logic around copying snapshot
>>>>> > SSTables. From the implementation perspective it would be just plain
>>>>> Java,
>>>>> > without any external dependencies etc. There seems to be a lot to
>>>>> gain for
>>>>> > relatively straightforward additions to the snapshotting code.
>>>>>
>>>>> Agree that there are things that need to move closer to the database
>>>>> process
>>>>> where it makes sense. Repair is an obvious one. This change seems
>>>>> beneficial
>>>>> as well, and for use cases that do not need to rely on this
>>>>> functionality the
>>>>> behavior would remain the same, so I see this as a win.
>>>>>
>>>>> >
>>>>> > We did a serious housekeeping in CASSANDRA-18111 where we
>>>>> consolidated and
>>>>> > centralized everything related to snapshot management so we feel
>>>>> > comfortable to build logic like this on top of that. In fact,
>>>>> > CASSANDRA-18111 was a prerequisite for this because we did not want
>>>>> to base
>>>>> > this work on pre-18111 state of things when it comes to snapshots
>>>>> (it was
>>>>> > all over the code base, fragmented and duplicated logic etc).
>>>>> >
>>>>> > WDYT?
>>>>> >
>>>>> > Regards
>>>>> >
>>>>>
>>>>
>

Re: [DISCUSS] Snapshots outside of Cassandra data directory

Reply via email to