Re: zstd dictionaries

Yifan Cai Fri, 01 Aug 2025 13:34:49 -0700

I'm excited to hear about the interest in this feature! I'm scheduling a
Google Meet for Tuesday at 9 AM PST for one hour to discuss ZSTD with
dictionary compression in Cassandra. I will send the meeting details closer
to the time of the meeting. Please send me an email if you would like to
participate.



- Yifan

On Fri, Aug 1, 2025 at 12:58 PM Štefan Miklošovič <smikloso...@apache.org>
wrote:

> Sure! Please share the link to the call if possible. I will be glad to
> participate in this in whatever way I can.
>
> Regards
>
> On Fri, Aug 1, 2025 at 6:53 PM Dinesh Joshi <djo...@apache.org> wrote:
>
>> We have explored compressing using trained dictionaries at various levels
>> - component, table, keyspace level. Obviously component level dictionary
>> compression is best but results in a _lot_ of dictionaries. Anyway, this
>> really needs a bit of thought. Since there is a lot of interest and prior
>> work that each of us may have done, I would suggest we discuss the various
>> approaches in this thread or get on a quick call and bring back the summary
>> back to this list. Happy to organize a call if y'all are interested.
>>
>>
>> On Fri, Aug 1, 2025 at 9:07 AM Štefan Miklošovič <smikloso...@apache.org>
>> wrote:
>>
>>> Looking into my prototype (I think it is not doing anything yet, just
>>> WIP), I am training it on flushing so that is in line with what Jon is
>>> trying to do as well / what he suggests would be optimal.
>>>
>>> I do not have a dedicated dictionary component, what I tried to do was
>>> to just put the dict directly into COMPRESSION_INFO and then bumped the
>>> SSTable version with a boolean saying if it supports dictionary or not. So
>>> there is one component less at least.
>>>
>>> On Fri, Aug 1, 2025 at 5:59 PM Yifan Cai <yc25c...@gmail.com> wrote:
>>>
>>>> Yeah. I have built 2 POCs and have initial benchmark data comparing w/
>>>> and w/o dictionary. Unfortunately, the work went to backlog. I can pick it
>>>> up again if there is a demand for the feature.
>>>> There are some discussions in the Jira that Stefan linked. (thanks
>>>> Stefan!)
>>>>
>>>> - Yifan
>>>>
>>>> ------------------------------
>>>> *From:* Štefan Miklošovič <smikloso...@apache.org>
>>>> *Sent:* Friday, August 1, 2025 8:54:07 AM
>>>> *To:* dev@cassandra.apache.org <dev@cassandra.apache.org>
>>>> *Subject:* Re: zstd dictionaries
>>>>
>>>> There is already a ticket for this
>>>> https://issues.apache.org/jira/browse/CASSANDRA-17021
>>>>
>>>> I would love to see this in action, I was investigating this a few
>>>> years ago when ZSTD landed for the first time in 4.0 I think, I was
>>>> discussing that with Yifan, I think, if my memory serves me well, but, as
>>>> other things, it just went nowhere and was probably forgotten. I think that
>>>> there might be some POC around already. I started to work on this few years
>>>> ago and I abandoned it because ... I still have a branch around and it
>>>> would be great to compare what you have etc.
>>>>
>>>> On Fri, Aug 1, 2025 at 5:12 PM Jon Haddad <j...@rustyrazorblade.com>
>>>> wrote:
>>>>
>>>> Hi folks,
>>>>
>>>> I'm working with a team that's interested in seeing zstd dictionaries
>>>> for SSTable compression implemented due to the potential space and cost
>>>> savings. I wanted to share my initial thoughts and get the dev list's
>>>> thoughts as well.
>>>>
>>>> According to the zstd documentation [1], dictionaries can provide
>>>> approximately 3x improvement in space savings compared to non-dictionary
>>>> compression, along with roughly 4x faster compression and decompression
>>>> performance. The site notes that "training works if there is some
>>>> correlation in a family of small data samples. The more data-specific a
>>>> dictionary is, the more efficient it is (there is no universal dictionary).
>>>> Hence, deploying one dictionary per type of data will provide the greatest
>>>> benefits."
>>>>
>>>> The implementation appears straightforward from a code perspective, but
>>>> there are some architectural considerations I'd like to discuss:
>>>>
>>>> *Dictionary Management* One critical aspect is that the dictionary
>>>> becomes essential for data recovery - if you lose the dictionary, you lose
>>>> access to the compressed data, similar to losing an encryption key. (Please
>>>> correct me if I'm misunderstanding this dependency.)
>>>>
>>>> *Storage Approach* I'm considering two options for storing the
>>>> dictionary:
>>>>
>>>>    1.
>>>>
>>>>    *SSTable Component*: Save the dictionary as a separate SSTable
>>>>    component alongside the existing files. My hesitation here is that we've
>>>>    traditionally maintained that Data.db is the only essential component.
>>>>    2.
>>>>
>>>>    *Data.db Header*: Embed the dictionary directly in the Data.db file
>>>>    header.
>>>>
>>>> I'm strongly leaning toward the component approach because it avoids
>>>> modifications to the Data.db file format and can leverage our existing
>>>> streaming infrastructure.  I spoke with Blake about this and it sounds like
>>>> some of the newer features are more dependent on the components other than
>>>> Data, so I think this is acceptable.
>>>>
>>>> Dictionary Generation
>>>>
>>>> We currently default to flushing using LZ4, although I think that's
>>>> only an optimization to avoid high overhead from zSTD.  Using the memtable
>>>> data to create a dictionary prior to flush could remove the need for this
>>>> optimization entirely.
>>>>
>>>> During compaction, my plan is to generate dictionaries by either
>>>> sampling chunks from existing files (similar overhead to reading random
>>>> rows) or using just the first pages of data from each SSTable.  I'd need to
>>>> do some testing to see what the optimal setup is here.
>>>>
>>>> Opt-in: I think the initial version of this should be opt-in via a flag
>>>> on compression, but assuming it delivers on the performance and space gains
>>>> I think we'd want to remove the flag and make it the default.  Assuming
>>>> this feature lands in 6.0, I'd be looking to make it on by default in 7.0
>>>> when using zSTD.  The performance table lists lz4 as still more performant
>>>> so I think we'd probably leave it as the default compression strategy,
>>>> although performance benchmarks should be our guide here.
>>>>
>>>> Questions for the Community
>>>>
>>>>    - Has anyone already explored zstd dictionaries for Cassandra?
>>>>    - If so, are there existing performance tests or benchmarks?
>>>>    - Any thoughts on the storage approach or dictionary generation
>>>>    strategy?
>>>>    - Other considerations I might be missing?
>>>>
>>>> It seems like this would be a fairly easy win to improving density in
>>>> clusters that are limited by disk space per node.  It should also improve
>>>> overall performance by reducing compression and decompression overhead.
>>>> For the team I'm working with, we'd be reducing node count in AWS by
>>>> several hundred nodes.  We started with about 1K nodes at 4TB / node, and
>>>> were able to remove roughly 700 with the introduction of CASSANDRA-15452
>>>> (now at approximately 13TB /node), and are looking to cut the number at
>>>> least in half again.
>>>>
>>>> Looking forward to hearing your thoughts.
>>>>
>>>> Thanks,
>>>>
>>>> Jon
>>>> [1] https://facebook.github.io/zstd/
>>>>
>>>>

Re: zstd dictionaries

Reply via email to