Looking into my prototype (I think it is not doing anything yet, just WIP), I am training it on flushing so that is in line with what Jon is trying to do as well / what he suggests would be optimal.
I do not have a dedicated dictionary component, what I tried to do was to just put the dict directly into COMPRESSION_INFO and then bumped the SSTable version with a boolean saying if it supports dictionary or not. So there is one component less at least. On Fri, Aug 1, 2025 at 5:59 PM Yifan Cai <yc25c...@gmail.com> wrote: > Yeah. I have built 2 POCs and have initial benchmark data comparing w/ and > w/o dictionary. Unfortunately, the work went to backlog. I can pick it up > again if there is a demand for the feature. > There are some discussions in the Jira that Stefan linked. (thanks Stefan!) > > - Yifan > > ------------------------------ > *From:* Štefan Miklošovič <smikloso...@apache.org> > *Sent:* Friday, August 1, 2025 8:54:07 AM > *To:* dev@cassandra.apache.org <dev@cassandra.apache.org> > *Subject:* Re: zstd dictionaries > > There is already a ticket for this > https://issues.apache.org/jira/browse/CASSANDRA-17021 > > I would love to see this in action, I was investigating this a few years > ago when ZSTD landed for the first time in 4.0 I think, I was discussing > that with Yifan, I think, if my memory serves me well, but, as other > things, it just went nowhere and was probably forgotten. I think that there > might be some POC around already. I started to work on this few years ago > and I abandoned it because ... I still have a branch around and it would be > great to compare what you have etc. > > On Fri, Aug 1, 2025 at 5:12 PM Jon Haddad <j...@rustyrazorblade.com> wrote: > > Hi folks, > > I'm working with a team that's interested in seeing zstd dictionaries for > SSTable compression implemented due to the potential space and cost > savings. I wanted to share my initial thoughts and get the dev list's > thoughts as well. > > According to the zstd documentation [1], dictionaries can provide > approximately 3x improvement in space savings compared to non-dictionary > compression, along with roughly 4x faster compression and decompression > performance. The site notes that "training works if there is some > correlation in a family of small data samples. The more data-specific a > dictionary is, the more efficient it is (there is no universal dictionary). > Hence, deploying one dictionary per type of data will provide the greatest > benefits." > > The implementation appears straightforward from a code perspective, but > there are some architectural considerations I'd like to discuss: > > *Dictionary Management* One critical aspect is that the dictionary > becomes essential for data recovery - if you lose the dictionary, you lose > access to the compressed data, similar to losing an encryption key. (Please > correct me if I'm misunderstanding this dependency.) > > *Storage Approach* I'm considering two options for storing the dictionary: > > 1. > > *SSTable Component*: Save the dictionary as a separate SSTable > component alongside the existing files. My hesitation here is that we've > traditionally maintained that Data.db is the only essential component. > 2. > > *Data.db Header*: Embed the dictionary directly in the Data.db file > header. > > I'm strongly leaning toward the component approach because it avoids > modifications to the Data.db file format and can leverage our existing > streaming infrastructure. I spoke with Blake about this and it sounds like > some of the newer features are more dependent on the components other than > Data, so I think this is acceptable. > > Dictionary Generation > > We currently default to flushing using LZ4, although I think that's only > an optimization to avoid high overhead from zSTD. Using the memtable data > to create a dictionary prior to flush could remove the need for this > optimization entirely. > > During compaction, my plan is to generate dictionaries by either sampling > chunks from existing files (similar overhead to reading random rows) or > using just the first pages of data from each SSTable. I'd need to do some > testing to see what the optimal setup is here. > > Opt-in: I think the initial version of this should be opt-in via a flag on > compression, but assuming it delivers on the performance and space gains I > think we'd want to remove the flag and make it the default. Assuming this > feature lands in 6.0, I'd be looking to make it on by default in 7.0 when > using zSTD. The performance table lists lz4 as still more performant so I > think we'd probably leave it as the default compression strategy, although > performance benchmarks should be our guide here. > > Questions for the Community > > - Has anyone already explored zstd dictionaries for Cassandra? > - If so, are there existing performance tests or benchmarks? > - Any thoughts on the storage approach or dictionary generation > strategy? > - Other considerations I might be missing? > > It seems like this would be a fairly easy win to improving density in > clusters that are limited by disk space per node. It should also improve > overall performance by reducing compression and decompression overhead. > For the team I'm working with, we'd be reducing node count in AWS by > several hundred nodes. We started with about 1K nodes at 4TB / node, and > were able to remove roughly 700 with the introduction of CASSANDRA-15452 > (now at approximately 13TB /node), and are looking to cut the number at > least in half again. > > Looking forward to hearing your thoughts. > > Thanks, > > Jon > [1] https://facebook.github.io/zstd/ > >