Re: [DISCUSS] Snapshots outside of Cassandra data directory

Štefan Miklošovič Tue, 21 Jan 2025 01:05:22 -0800

If you ask specifically about how TTL snapshots are handled, there is a
thread running with a task scheduled every n seconds (not sure what is the
default) and it just checks against "expired_at" field in manifest if it is
expired or not. If it is then it will proceed to delete it as any other
snapshot. Then the logic I have described above would be in place.


On Tue, Jan 21, 2025 at 10:01 AM Štefan Miklošovič <smikloso...@apache.org>
wrote:

>
>
> On Tue, Jan 21, 2025 at 5:30 AM Francisco Guerrero <fran...@apache.org>
> wrote:
>
>> I think we should evaluate the benefits of the feature you are proposing
>> independently on how it might be used by Sidecar or other tools. As it
>> is, it already sounds like a useful functionality to have in the core of
>> the
>> Cassandra process.
>>
>> Tooling around Cassandra, including Sidecar, can then leverage this
>> functionality to create snapshots, and then add additional capabilities
>> on top to perform backups.
>>
>> I've added some comments inline below:
>>
>> On 2025/01/12 18:25:07 Štefan Miklošovič wrote:
>> > Hi,
>> >
>> > I would like to run this through ML to gather feedback as we are
>> > contemplating about making this happen.
>> >
>> > Currently, snapshots are just hardlinks located in a snapshot directory
>> to
>> > live data directory. That is super handy as it occupies virtually zero
>> disk
>> > space etc (as long as underlying SSTables are not compacted away, then
>> > their size would "materialize").
>> >
>> > On the other hand, because it is a hardlink, it is not possible to make
>> > hard links across block devices (infamous "Invalid cross-device link"
>> > error). That means that snapshots can ever be located on the very same
>> disk
>> > Cassandra has its datadirs on.
>> >
>> > Imagine there is a company ABC which has 10 TiB disk (or NFS share)
>> mounted
>> > to a Cassandra node and they would like to use that as a cheap / cold
>> > storage of snapshots. They do not care about the speed of such storage
>> nor
>> > they care about how much space it occupies etc. when it comes to
>> snapshots.
>> > On the other hand, they do not want to have snapshots occupying a disk
>> > space where Cassandra has its data because they consider it to be a
>> waste
>> > of space. They would like to utilize fast disk and disk space for
>> > production data to the max and snapshots might eat a lot of that space
>> > unnecessarily.
>> >
>> > There might be a configuration property like "snapshot_root_dir:
>> > /mnt/nfs/cassandra" and if a snapshot is taken, it would just copy
>> SSTables
>> > there, but we need to be a little bit smart here (By default, it would
>> all
>> > work as it does now - hard links to snapshot directories located under
>> > Cassandra's data_file_directories.)
>> >
>> > Because it is a copy, it occupies disk space. But if we took 100
>> snapshots
>> > on the same SSTables, we would not want to copy the same files 100
>> times.
>> > There is a very handy way to prevent this - unique SSTable identifiers
>> > (under already existing uuid_sstable_identifiers_enabled property) so we
>> > could have a flat destination hierarchy where all SSTables would be
>> located
>>
>> I have some questions around the flat destination hierarchy. For example,
>> how
>> do you keep track of TTLs for different snapshots. What if one snapshot
>> doesn't
>> have a TTL and the second does. Those details will need to be worked out.
>> Of
>> course, we can discuss these things during implementation of the feature.
>>
>
> There would be a list of files a logical snapshot consists of in a
> snapshot manifest. We would keep track of what SSTables are in what
> snapshots.
>
> This is not tied to TTL, any two non-expiring snapshots could share the
> same SSTables. If you go to remove one snapshot and you go to remove a
> SSTable, you need to check if that particular SSTable is not the part of
> any other snapshot. If it is, then you can not remove it while removing
> that snapshot because that table is the part of another one. If you removed
> it, then you would make the other snapshot corrupt as it would miss that
> SSTable.
>
> This logic is already implemented in Instaclustr Esop (1) (Esop as that
> Greek guy telling the fables (2)), the tooling we offer for backups and
> restores against various cloud providers. This stuff was already
> implemented and I feel confident it might be replicated here but without a
> ton of baggage which comes from the fact that we need to accommodate
> specific clouds. I am not saying at all that the code from that tool would
> end up in Cassandra. No. What I am saying is that we have implemented that
> logic already and in Cassandra it would be just way simpler.
>
> (1) https://github.com/instaclustr/esop
> (2) https://en.wikipedia.org/wiki/Aesop
>
>
>> > in the same directory and we would just check if such SSTable is already
>> > there or not before copying it. Snapshot manifests (currently under
>> > manifest.json) would then contain all SSTables a logical snapshot
>> consists
>> > of.
>> >
>> > This would be possible only for _user snapshots_. All snapshots taken by
>> > Cassandra itself (diagnostic snapshots, snapshots upon repairs,
>> snapshots
>> > against all system tables, ephemeral snapshots) would continue to be
>> hard
>> > links and it would not be possible to locate them outside of live data
>> > dirs.
>> >
>> > The advantages / characteristics of this approach for user snapshots:
>> >
>> > 1. Cassandra will be able to create snapshots located on different
>> devices.
>> > 2. From an implementation perspective it would be totally transparent,
>> > there will be no specific code about "where" we copy. We would just
>> copy,
>> > from Java perspective, as we copy anywhere else.
>> > 3. All the tooling would work as it does now - nodetool listsnapshots /
>> > clearsnapshot / snapshot. Same outputs, same behavior.
>> > 4. No need to use external tools copying SSTables to desired
>> destination,
>> > custom scripts, manual synchronisation ...
>> > 5. Snapshots located outside of Cassandra live data dirs would behave
>> the
>> > same when it comes to snapshot TTL. (TTL on snapshot means that after so
>> > and so period of time, they are automatically removed). This logic
>> would be
>> > the same. Hence, there is not any need to re-invent a wheel when it
>> comes
>> > to removing expired snapshots from the operator's perspective.
>> > 6. Such a solution would deduplicate SSTables so it would be as
>> > space-efficient as possible (but not as efficient as hardlinks, because
>> of
>> > obvious reasons mentioned above).
>> >
>> > It seems to me that there is recently a "push" to add more logic to
>> > Cassandra where it was previously delegated for external toolings, for
>> > example CEP around automatic repairs are basically doing what external
>> > tooling does, we just move it under Cassandra. We would love to get rid
>> of
>> > a lot of tooling and customly written logic around copying snapshot
>> > SSTables. From the implementation perspective it would be just plain
>> Java,
>> > without any external dependencies etc. There seems to be a lot to gain
>> for
>> > relatively straightforward additions to the snapshotting code.
>>
>> Agree that there are things that need to move closer to the database
>> process
>> where it makes sense. Repair is an obvious one. This change seems
>> beneficial
>> as well, and for use cases that do not need to rely on this functionality
>> the
>> behavior would remain the same, so I see this as a win.
>>
>> >
>> > We did a serious housekeeping in CASSANDRA-18111 where we consolidated
>> and
>> > centralized everything related to snapshot management so we feel
>> > comfortable to build logic like this on top of that. In fact,
>> > CASSANDRA-18111 was a prerequisite for this because we did not want to
>> base
>> > this work on pre-18111 state of things when it comes to snapshots (it
>> was
>> > all over the code base, fragmented and duplicated logic etc).
>> >
>> > WDYT?
>> >
>> > Regards
>> >
>>
>

Re: [DISCUSS] Snapshots outside of Cassandra data directory

Reply via email to