Re: zstd dictionaries

Štefan Miklošovič Fri, 01 Aug 2025 09:05:18 -0700

Looking into my prototype (I think it is not doing anything yet, just WIP),
I am training it on flushing so that is in line with what Jon is trying to
do as well / what he suggests would be optimal.


I do not have a dedicated dictionary component, what I tried to do was to
just put the dict directly into COMPRESSION_INFO and then bumped the
SSTable version with a boolean saying if it supports dictionary or not. So
there is one component less at least.

On Fri, Aug 1, 2025 at 5:59 PM Yifan Cai <[email protected]> wrote:

> Yeah. I have built 2 POCs and have initial benchmark data comparing w/ and
> w/o dictionary. Unfortunately, the work went to backlog. I can pick it up
> again if there is a demand for the feature.
> There are some discussions in the Jira that Stefan linked. (thanks Stefan!)
>
> - Yifan
>
> ------------------------------
> *From:* Štefan Miklošovič <[email protected]>
> *Sent:* Friday, August 1, 2025 8:54:07 AM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: zstd dictionaries
>
> There is already a ticket for this
> https://issues.apache.org/jira/browse/CASSANDRA-17021
>
> I would love to see this in action, I was investigating this a few years
> ago when ZSTD landed for the first time in 4.0 I think, I was discussing
> that with Yifan, I think, if my memory serves me well, but, as other
> things, it just went nowhere and was probably forgotten. I think that there
> might be some POC around already. I started to work on this few years ago
> and I abandoned it because ... I still have a branch around and it would be
> great to compare what you have etc.
>
> On Fri, Aug 1, 2025 at 5:12 PM Jon Haddad <[email protected]> wrote:
>
> Hi folks,
>
> I'm working with a team that's interested in seeing zstd dictionaries for
> SSTable compression implemented due to the potential space and cost
> savings. I wanted to share my initial thoughts and get the dev list's
> thoughts as well.
>
> According to the zstd documentation [1], dictionaries can provide
> approximately 3x improvement in space savings compared to non-dictionary
> compression, along with roughly 4x faster compression and decompression
> performance. The site notes that "training works if there is some
> correlation in a family of small data samples. The more data-specific a
> dictionary is, the more efficient it is (there is no universal dictionary).
> Hence, deploying one dictionary per type of data will provide the greatest
> benefits."
>
> The implementation appears straightforward from a code perspective, but
> there are some architectural considerations I'd like to discuss:
>
> *Dictionary Management* One critical aspect is that the dictionary
> becomes essential for data recovery - if you lose the dictionary, you lose
> access to the compressed data, similar to losing an encryption key. (Please
> correct me if I'm misunderstanding this dependency.)
>
> *Storage Approach* I'm considering two options for storing the dictionary:
>
>    1.
>
>    *SSTable Component*: Save the dictionary as a separate SSTable
>    component alongside the existing files. My hesitation here is that we've
>    traditionally maintained that Data.db is the only essential component.
>    2.
>
>    *Data.db Header*: Embed the dictionary directly in the Data.db file
>    header.
>
> I'm strongly leaning toward the component approach because it avoids
> modifications to the Data.db file format and can leverage our existing
> streaming infrastructure.  I spoke with Blake about this and it sounds like
> some of the newer features are more dependent on the components other than
> Data, so I think this is acceptable.
>
> Dictionary Generation
>
> We currently default to flushing using LZ4, although I think that's only
> an optimization to avoid high overhead from zSTD.  Using the memtable data
> to create a dictionary prior to flush could remove the need for this
> optimization entirely.
>
> During compaction, my plan is to generate dictionaries by either sampling
> chunks from existing files (similar overhead to reading random rows) or
> using just the first pages of data from each SSTable.  I'd need to do some
> testing to see what the optimal setup is here.
>
> Opt-in: I think the initial version of this should be opt-in via a flag on
> compression, but assuming it delivers on the performance and space gains I
> think we'd want to remove the flag and make it the default.  Assuming this
> feature lands in 6.0, I'd be looking to make it on by default in 7.0 when
> using zSTD.  The performance table lists lz4 as still more performant so I
> think we'd probably leave it as the default compression strategy, although
> performance benchmarks should be our guide here.
>
> Questions for the Community
>
>    - Has anyone already explored zstd dictionaries for Cassandra?
>    - If so, are there existing performance tests or benchmarks?
>    - Any thoughts on the storage approach or dictionary generation
>    strategy?
>    - Other considerations I might be missing?
>
> It seems like this would be a fairly easy win to improving density in
> clusters that are limited by disk space per node.  It should also improve
> overall performance by reducing compression and decompression overhead.
> For the team I'm working with, we'd be reducing node count in AWS by
> several hundred nodes.  We started with about 1K nodes at 4TB / node, and
> were able to remove roughly 700 with the introduction of CASSANDRA-15452
> (now at approximately 13TB /node), and are looking to cut the number at
> least in half again.
>
> Looking forward to hearing your thoughts.
>
> Thanks,
>
> Jon
> [1] https://facebook.github.io/zstd/
>
>

Re: zstd dictionaries

Reply via email to