Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

Štefan Miklošovič Wed, 05 Mar 2025 00:06:49 -0800

Scott,

what you wrote is all correct, but I have a feeling that both you and Jeff
are talking about something different, some other aspect of using that.


It seems that I still need to explain myself that I don't consider object
storage to be useless, it is as if everybody has to make the point about
the opposite. I agree with you already.

It seems to me that the use cases you want to use s3 for (or any object
storage for that matter) is actively reading / writing to satisfy queries,
offload some stuff etc.

What I am talking about when it comes to s3 mounted locally is just to use
it for copying SSTables there which were taken by a snapshot. In, (1) I
never wanted to use s3 for anything else than to literally copy SSTable
there as part of snapshotting and be done with it. If I exaggerate to make
a point, where do I need to "hurry" so I care about the speed specifically
in this particular case? I have never said I consider it to be important.

What I consider important is that it is super easy to use - this approach
is cloud agnostic, we do not need to implement anything in Cassandra, no
messing with dependencies etc. It is "future-proof" in such a sense that
whatever cloud somebody wants to use for storing snapshots, all it takes is
to _somehow_ mount it locally and all will work out of the box.

You want to leverage object storage for more "involved" use cases.

I do not see how mounting a dir and copy files there would "clash" with
your way of looking at it. Why can't we have both?

I keep repeating this all over again but I still don't know how it is going
to be actually done. If you want to e.g. support object storage like, I
don't know, GCP (I do not have any clue if that is even code-able), then
there would need to be _somebody_ who will integrate with it. With mounting
a dir for snapshotting purposes, we do not need to deal with that. This
aspect of additional complexity when it comes to coding, integrating,
deploying etc. seems to be repeatedly overlooked and I would really
appreciate it if we spent a little bit more time expanding that area.

I do not have a problem with using object storage for things you want, but
I do not get why it should automatically disqualify scenarios when a user
figures out that mounting that storage locally is sufficient.

(1) https://lists.apache.org/thread/8cz5fh835ojnxwtn1479q31smm5x7nxt

On Wed, Mar 5, 2025 at 6:22 AM C. Scott Andreas <sc...@paradoxica.net>
wrote:

> To Jeff’s point on tactical vs. strategic, here’s the big picture for me
> on object storage:
>
> *– Object storage is 70% cheaper:*
> Replicated flash block storage is extremely expensive, and more so with
> compute resources constantly attached. If one were to build a storage
> platform on top of a cloud provider’s compute and storage infrastructure,
> selective use of object storage is essential to even being in the ballpark
> of managed offerings on price. EBS is 8¢/GB. S3 is 2.3¢/GB. It’s over 70%
> cheaper.
>
> *– Local/block storage is priced on storage *provisioned*. Object storage
> is priced on storage *consumed*:*
> It’s actually better than 70%. Local/block storage is priced based on the
> size of disks/volumes provisioned. While they may be resizable, resizing is
> generally inelastic. This typically produces a large gap between storage
> consumed vs. storage provisioned - and poor utilization. Object storage is
> typically priced on storage that is actually consumed.
>
> *– Object storage integration is the simplest path to complete decoupling
> of CPU and storage:*
> Block volumes are more fungible than local disk, but aren’t even close in
> flexibility to an SSTable that can be accessed by any function. Object is
> also the only sensible path to implementing a serverless database whose
> query facilities can be deployed on a function-as-a-service platform. That
> enables one to reach an idle compute cost of zero and an idle storage cost
> of 2.3¢/GB/month (S3).
>
> *– Object storage enables scale-to-zero:*
> Object storage integration is the only path for most databases to provide
> a scale-to-zero offering that doesn’t rely on keeping hot NVMe or block
> storage attached 24/7 while a database receives zero queries per second.
>
> *– Scale to zero is one of the easiest paths to zero marginal cost (the
> other is multitenancy - and not mutually exclusive):*
> Database platforms operated in a cluster-as-a-service model incur a
> constant fixed cost of provisioned resources regardless of whether they are
> in use. That’s fine for platforms that pass the full cost of resources
> consumed back to someone — but it produces poor economics and resource
> waste. Ability to scale to zero dramatically reduces the cost of
> provisioning and maintaining an un/underutilized database.
>
> *– It’s not all or nothing:*
> There are super sensible ways to pair local/block storage and object
> storage. One might be to store upper-level SSTable data components in
> object storage; and all other SSTable components (TOC, CompressionInfo,
> primary index, etc) on local flash. This gives you a way to rapidly
> enumerate and navigate SSTable metadata while only paying the cost of reads
> when fetching data (and possibly from upper-level SSTables only).
> Alternately, one could offload only older partitions of data in TWCS - time
> series data older than 30 days, a year, etc.
>
> *– Mounting a bucket as a filesystem is unusable:*
> Others have made this point. Naively mounting an object storage bucket as
> a filesystem produces uniformly *terrible* results. S3 time to first byte
> is in the 20-30ms+ range. S3 one-zone is closer to the 5-10ms range which
> is on par with the seek latency of a spinning disk. Despite C* originally
> being designed to operate well on spinning disks, a filesystem-like
> abstraction backed by object storage today will result in awful surprises
> due to IO patterns like those found by Jon H. and Jordan recently.
>
> *– Object Storage can unify “database” and “lakehouse”:*
> One could imagine Iceberg integration that enables manifest/snapshot-based
> querying of SSTables in an object store via Spark or similar platforms,
> with zero ETL or light cone contact with a production database process.
>
> The reason people care about object is that it’s 70%+ cheaper than flash -
> and 90%+ cheaper if the software querying it isn’t always running, too.
>
> – Scott
>
> —
> Mobile
>
> On Mar 4, 2025, at 12:29 PM, Štefan Miklošovič <smikloso...@apache.org>
> wrote:
>
> 
> Jeff,
>
> when it comes to snapshots, there was already discussion in the other
> thread I am not sure you are aware of (1), here (2) I am talking about
> Sidecar + snapshots specifically. One "caveat" of Sidecar is that you
> actually _need_ sidecar if we ever contemplated Sidecar doing upload /
> backup (by whatever means).
>
> "s3 bucket as a mounted directory" bypasses the necessity of having
> Sidecar deployed. I do not want to repeat what I wrote in (2), all the
> reasoning is there.
>
> My primary motivation to do it by mounting is to 1) not use Sidecar if not
> needed 2) just save snapshot SSTables outside of table data dirs (which is
> imho completely fine requirement on its own, e.g. putting snapshots on a
> different / slow disk etc, why does it have to be on the same disk as data
> are?)
>
> The vibe I got from that thread was that what I was proposing is in
> general acceptable and I wanted to figure out the details as Blake was
> mentioning incremental backs as well. Other than that, I was thinking that
> we are pretty much settled on how that should be, that thread was up for a
> long time so I was thinking people in general do not have a problem with
> that.
>
> How I read your email, specifically this part:
>
> "This is the same feedback I gave the sidecar with the rsync to another
> machine proposal. If we stop doing one off tactical projects and map out
> the actual problem we’re solving, we can get the right interfaces."
>
> it seems to me that you are categorizing "s3 bucket mounted locally" as
> one of these "one off tactical projects" which in translation means that
> you would like to see that approach not implemented and we should rather
> focus on doing everything via proxies?
>
> One big downside of that which nobody answered yet (seems like that to me
> as far as I checked) is how would this actually look like on deployment and
> delivery? As I said in (1), we would need to code up proxy for every remote
> storage, Azure, S3, GCP to name a few. We would need to implement all of
> these for each and every cloud.
>
> Secondly, who would implement that and where would that code live? Is it
> up to individuals to code it up internally? When we want to talk to s3, we
> need to put all s3 dependencies to classpath, who is going to integrate
> that, is that even possible? Similar for other clouds.
>
> By mounting a dir, we just do not do anything with Cassandra's class path.
> It is as it was, simple, easy, and we interact with it as we are used to.
>
> I see that proxy might be viable for some applications but I think it has
> also non-trivial disadvantages operationally-wise.
>
> (1) https://lists.apache.org/thread/8cz5fh835ojnxwtn1479q31smm5x7nxt
> (2) https://lists.apache.org/thread/mttg75ps49qkob6km4l74fmp879v76qs
>
> On Tue, Mar 4, 2025 at 5:13 PM Jeff Jirsa <jji...@gmail.com> wrote:
>
>> Mounted dirs give up the opportunity to change the IO model to account
>> for different behaviors. The configurable channel proxy may suffer from the
>> same IO constraints depending on implementation, too. But it may also
>> become viable.
>>
>> The snapshot outside of the mounted file system seems like you’re
>> implicitly implementing a one off in process backup. This is the same
>> feedback I gave the sidecar with the rsync to another machine proposal. If
>> we stop doing one off tactical projects and map out the actual problem
>> we’re solving, we can get the right interfaces.  Also, you can probably
>> just have the sidecar rsync thing do your snapshot to another directory on
>> host.
>>
>> But if every sstable makes its way to s3, things like native backup,
>> restoring from backup, recovering from local volumes can look VERY different
>>
>>
>>
>> On Mar 4, 2025, at 3:57 PM, Štefan Miklošovič <smikloso...@apache.org>
>> wrote:
>>
>> 
>> For what it's worth, as it might come to somebody I am rejecting this
>> altogether (which is not the case, all I am trying to say is that we should
>> just think about it more) - it would be cool to know more about the
>> experience of others when it comes to this, maybe somebody already tried to
>> mount and it did not work as expected?
>>
>> On the other hand, there is this "snapshots outside data dir" effort I am
>> doing and if we did it with this, then I can imagine that we could say "and
>> if you deal with snapshots, use this proxy instead" which would
>> transparently upload it to s3.
>>
>> Then we would not need to do anything at all, code-wise. We would not
>> need to store snapshots "outside of data dir" just to be able to place it
>> on a directory which is mounted as an s3 bucket.
>>
>> I don't know if it is possible to do it like that. Worth to explore I
>> guess.
>>
>> I like mounted dirs for its simplicity and I guess that for copying files
>> it might be just enough. Plus we would not need to add all s3 jars on CP
>> either etc ...
>>
>> On Tue, Mar 4, 2025 at 2:46 PM Štefan Miklošovič <smikloso...@apache.org>
>> wrote:
>>
>>> I don't say that using remote object storage is useless.
>>>
>>> I am just saying that I don't see the difference. I have not measured
>>> that but I can imagine that s3 mounted would use, under the hood, the same
>>> calls to s3 api. How else would it be done? You need to talk to remote s3
>>> storage eventually anyway. So why does it matter if we call s3 api from
>>> Java or by other means from some "s3 driver"?  It is eventually using same
>>> thing, no?
>>>
>>> On Tue, Mar 4, 2025 at 12:47 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>> Mounting an s3 bucket as a directory is an easy but poor implementation
>>>> of object backed storage for databases
>>>>
>>>> Object storage is durable (most data loss is due to bugs not concurrent
>>>> hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A  huge
>>>> number of modern systems are object-storage-only because the approximately
>>>> infinite scale / cost / throughput tradeoffs often make up for the latency.
>>>>
>>>> Outright dismissing object storage for Cassandra is short sighted - it
>>>> needs to be done in a way that makes sense, not just blindly copying over
>>>> the block access patterns to object.
>>>>
>>>>
>>>> On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič <smikloso...@apache.org>
>>>> wrote:
>>>>
>>>> 
>>>> I do not think we need this CEP, honestly. I don't want to diss this
>>>> unnecessarily but if you mount a remote storage locally (e.g. mounting s3
>>>> bucket as if it was any other directory on node's machine), then what is
>>>> this CEP good for?
>>>>
>>>> Not talking about the necessity to put all dependencies to be able to
>>>> talk to respective remote storage to Cassandra's class path, introducing
>>>> potential problems with dependencies and their possible incompatibilities /
>>>> different versions etc ...
>>>>
>>>> On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas <sc...@paradoxica.net>
>>>> wrote:
>>>>
>>>>> I’d love to see this implemented — where “this” is a proxy for some
>>>>> notion of support for remote object storage, perhaps usable by compaction
>>>>> strategies like TWCS to migrate data older than a threshold from a local
>>>>> filesystem to remote object.
>>>>>
>>>>> It’s not an area where I can currently dedicate engineering effort.
>>>>> But if others are interested in contributing a feature like this, I’d see
>>>>> it as valuable for the project and would be happy to collaborate on
>>>>> design/architecture/goals.
>>>>>
>>>>> – Scott
>>>>>
>>>>> On Feb 26, 2025, at 6:56 AM, guo Maxwell <cclive1...@gmail.com> wrote:
>>>>>
>>>>> 
>>>>> Is anyone else interested in continuing to discuss this topic?
>>>>>
>>>>> guo Maxwell <cclive1...@gmail.com> 于2024年9月20日周五 09:44写道：
>>>>>
>>>>>> I discussed this offline with Claude, he is no longer working on
>>>>>> this.
>>>>>>
>>>>>> It's a pity. I think this is a very valuable thing. Commitlog's
>>>>>> archiving and restore may be able to use the relevant code if it is
>>>>>> completed.
>>>>>>
>>>>>> Patrick McFadin <pmcfa...@gmail.com>于2024年9月20日 周五上午2:01写道：
>>>>>>
>>>>>>> Thanks for reviving this one!
>>>>>>>
>>>>>>> On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell <cclive1...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is there any update on this topic?  It seems that things can make a
>>>>>>>> big progress if  Jake Luciani  can find someone who can make the
>>>>>>>> FileSystemProvider code accessible.
>>>>>>>>
>>>>>>>> Jon Haddad <j...@jonhaddad.com> 于2023年12月16日周六 05:29写道：
>>>>>>>>
>>>>>>>>> At a high level I really like the idea of being able to better
>>>>>>>>> leverage cheaper storage especially object stores like S3.
>>>>>>>>>
>>>>>>>>> One important thing though - I feel pretty strongly that there's a
>>>>>>>>> big, deal breaking downside.   Backups, disk failure policies, 
>>>>>>>>> snapshots
>>>>>>>>> and possibly repairs would get more complicated which haven't been
>>>>>>>>> particularly great in the past, and of course there's the issue of 
>>>>>>>>> failure
>>>>>>>>> recovery being only partially possible if you're looking at a durable 
>>>>>>>>> block
>>>>>>>>> store paired with an ephemeral one with some of your data not 
>>>>>>>>> replicated to
>>>>>>>>> the cold side.  That introduces a failure case that's unacceptable 
>>>>>>>>> for most
>>>>>>>>> teams, which results in needing to implement potentially 2 different 
>>>>>>>>> backup
>>>>>>>>> solutions.  This is operationally complex with a lot of surface area 
>>>>>>>>> for
>>>>>>>>> headaches.  I think a lot of teams would probably have an issue with 
>>>>>>>>> the
>>>>>>>>> big question mark around durability and I probably would avoid it 
>>>>>>>>> myself.
>>>>>>>>>
>>>>>>>>> On the other hand, I'm +1 if we approach it something slightly
>>>>>>>>> differently - where _all_ the data is located on the cold storage, 
>>>>>>>>> with the
>>>>>>>>> local hot storage used as a cache.  This means we can use the cold
>>>>>>>>> directories for the complete dataset, simplifying backups and node
>>>>>>>>> replacements.
>>>>>>>>>
>>>>>>>>> For a little background, we had a ticket several years ago where I
>>>>>>>>> pointed out it was possible to do this *today* at the operating system
>>>>>>>>> level as long as you're using block devices (vs an object store) and 
>>>>>>>>> LVM
>>>>>>>>> [1].  For example, this works well with GP3 EBS w/ low IOPS 
>>>>>>>>> provisioning +
>>>>>>>>> local NVMe to get a nice balance of great read performance without 
>>>>>>>>> going
>>>>>>>>> nuts on the cost for IOPS.  I also wrote about this in a little more 
>>>>>>>>> detail
>>>>>>>>> in my blog [2].  There's also the new mount point tech in AWS which 
>>>>>>>>> pretty
>>>>>>>>> much does exactly what I've suggested above [3] that's probably worth
>>>>>>>>> evaluating just to get a feel for it.
>>>>>>>>>
>>>>>>>>> I'm not insisting we require LVM or the AWS S3 fs, since that
>>>>>>>>> would rule out other cloud providers, but I am pretty confident that 
>>>>>>>>> the
>>>>>>>>> entire dataset should reside in the "cold" side of things for the 
>>>>>>>>> practical
>>>>>>>>> and technical reasons I listed above.  I don't think it massively 
>>>>>>>>> changes
>>>>>>>>> the proposal, and should simplify things for everyone.
>>>>>>>>>
>>>>>>>>> Jon
>>>>>>>>>
>>>>>>>>> [1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/
>>>>>>>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-8460
>>>>>>>>> [3]
>>>>>>>>> https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Dec 14, 2023 at 1:56 AM Claude Warren <cla...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Is there still interest in this?  Can we get some points down on
>>>>>>>>>> electrons so that we all understand the issues?
>>>>>>>>>>
>>>>>>>>>> While it is fairly simple to redirect the read/write to something
>>>>>>>>>> other  than the local system for a single node this will not solve 
>>>>>>>>>> the
>>>>>>>>>> problem for tiered storage.
>>>>>>>>>>
>>>>>>>>>> Tiered storage will require that on read/write the primary key be
>>>>>>>>>> assessed and determine if the read/write should be redirected.  My
>>>>>>>>>> reasoning for this statement is that in a cluster with a replication 
>>>>>>>>>> factor
>>>>>>>>>> greater than 1 the node will store data for the keys that would be
>>>>>>>>>> allocated to it in a cluster with a replication factor = 1, as well 
>>>>>>>>>> as some
>>>>>>>>>> keys from nodes earlier in the ring.
>>>>>>>>>>
>>>>>>>>>> Even if we can get the primary keys for all the data we want to
>>>>>>>>>> write to "cold storage" to map to a single node a replication factor 
>>>>>>>>>> > 1
>>>>>>>>>> means that data will also be placed in "normal storage" on 
>>>>>>>>>> subsequent nodes.
>>>>>>>>>>
>>>>>>>>>> To overcome this, we have to explore ways to route data to
>>>>>>>>>> different storage based on the keys and that different storage may 
>>>>>>>>>> have to
>>>>>>>>>> be available on _all_  the nodes.
>>>>>>>>>>
>>>>>>>>>> Have any of the partial solutions mentioned in this email chain
>>>>>>>>>> (or others) solved this problem?
>>>>>>>>>>
>>>>>>>>>> Claude
>>>>>>>>>>
>>>>>>>>>

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

Reply via email to