I'm excited to hear about the interest in this feature! I'm scheduling a Google Meet for Tuesday at 9 AM PST for one hour to discuss ZSTD with dictionary compression in Cassandra. I will send the meeting details closer to the time of the meeting. Please send me an email if you would like to participate.
- Yifan On Fri, Aug 1, 2025 at 12:58 PM Štefan Miklošovič <smikloso...@apache.org> wrote: > Sure! Please share the link to the call if possible. I will be glad to > participate in this in whatever way I can. > > Regards > > On Fri, Aug 1, 2025 at 6:53 PM Dinesh Joshi <djo...@apache.org> wrote: > >> We have explored compressing using trained dictionaries at various levels >> - component, table, keyspace level. Obviously component level dictionary >> compression is best but results in a _lot_ of dictionaries. Anyway, this >> really needs a bit of thought. Since there is a lot of interest and prior >> work that each of us may have done, I would suggest we discuss the various >> approaches in this thread or get on a quick call and bring back the summary >> back to this list. Happy to organize a call if y'all are interested. >> >> >> On Fri, Aug 1, 2025 at 9:07 AM Štefan Miklošovič <smikloso...@apache.org> >> wrote: >> >>> Looking into my prototype (I think it is not doing anything yet, just >>> WIP), I am training it on flushing so that is in line with what Jon is >>> trying to do as well / what he suggests would be optimal. >>> >>> I do not have a dedicated dictionary component, what I tried to do was >>> to just put the dict directly into COMPRESSION_INFO and then bumped the >>> SSTable version with a boolean saying if it supports dictionary or not. So >>> there is one component less at least. >>> >>> On Fri, Aug 1, 2025 at 5:59 PM Yifan Cai <yc25c...@gmail.com> wrote: >>> >>>> Yeah. I have built 2 POCs and have initial benchmark data comparing w/ >>>> and w/o dictionary. Unfortunately, the work went to backlog. I can pick it >>>> up again if there is a demand for the feature. >>>> There are some discussions in the Jira that Stefan linked. (thanks >>>> Stefan!) >>>> >>>> - Yifan >>>> >>>> ------------------------------ >>>> *From:* Štefan Miklošovič <smikloso...@apache.org> >>>> *Sent:* Friday, August 1, 2025 8:54:07 AM >>>> *To:* dev@cassandra.apache.org <dev@cassandra.apache.org> >>>> *Subject:* Re: zstd dictionaries >>>> >>>> There is already a ticket for this >>>> https://issues.apache.org/jira/browse/CASSANDRA-17021 >>>> >>>> I would love to see this in action, I was investigating this a few >>>> years ago when ZSTD landed for the first time in 4.0 I think, I was >>>> discussing that with Yifan, I think, if my memory serves me well, but, as >>>> other things, it just went nowhere and was probably forgotten. I think that >>>> there might be some POC around already. I started to work on this few years >>>> ago and I abandoned it because ... I still have a branch around and it >>>> would be great to compare what you have etc. >>>> >>>> On Fri, Aug 1, 2025 at 5:12 PM Jon Haddad <j...@rustyrazorblade.com> >>>> wrote: >>>> >>>> Hi folks, >>>> >>>> I'm working with a team that's interested in seeing zstd dictionaries >>>> for SSTable compression implemented due to the potential space and cost >>>> savings. I wanted to share my initial thoughts and get the dev list's >>>> thoughts as well. >>>> >>>> According to the zstd documentation [1], dictionaries can provide >>>> approximately 3x improvement in space savings compared to non-dictionary >>>> compression, along with roughly 4x faster compression and decompression >>>> performance. The site notes that "training works if there is some >>>> correlation in a family of small data samples. The more data-specific a >>>> dictionary is, the more efficient it is (there is no universal dictionary). >>>> Hence, deploying one dictionary per type of data will provide the greatest >>>> benefits." >>>> >>>> The implementation appears straightforward from a code perspective, but >>>> there are some architectural considerations I'd like to discuss: >>>> >>>> *Dictionary Management* One critical aspect is that the dictionary >>>> becomes essential for data recovery - if you lose the dictionary, you lose >>>> access to the compressed data, similar to losing an encryption key. (Please >>>> correct me if I'm misunderstanding this dependency.) >>>> >>>> *Storage Approach* I'm considering two options for storing the >>>> dictionary: >>>> >>>> 1. >>>> >>>> *SSTable Component*: Save the dictionary as a separate SSTable >>>> component alongside the existing files. My hesitation here is that we've >>>> traditionally maintained that Data.db is the only essential component. >>>> 2. >>>> >>>> *Data.db Header*: Embed the dictionary directly in the Data.db file >>>> header. >>>> >>>> I'm strongly leaning toward the component approach because it avoids >>>> modifications to the Data.db file format and can leverage our existing >>>> streaming infrastructure. I spoke with Blake about this and it sounds like >>>> some of the newer features are more dependent on the components other than >>>> Data, so I think this is acceptable. >>>> >>>> Dictionary Generation >>>> >>>> We currently default to flushing using LZ4, although I think that's >>>> only an optimization to avoid high overhead from zSTD. Using the memtable >>>> data to create a dictionary prior to flush could remove the need for this >>>> optimization entirely. >>>> >>>> During compaction, my plan is to generate dictionaries by either >>>> sampling chunks from existing files (similar overhead to reading random >>>> rows) or using just the first pages of data from each SSTable. I'd need to >>>> do some testing to see what the optimal setup is here. >>>> >>>> Opt-in: I think the initial version of this should be opt-in via a flag >>>> on compression, but assuming it delivers on the performance and space gains >>>> I think we'd want to remove the flag and make it the default. Assuming >>>> this feature lands in 6.0, I'd be looking to make it on by default in 7.0 >>>> when using zSTD. The performance table lists lz4 as still more performant >>>> so I think we'd probably leave it as the default compression strategy, >>>> although performance benchmarks should be our guide here. >>>> >>>> Questions for the Community >>>> >>>> - Has anyone already explored zstd dictionaries for Cassandra? >>>> - If so, are there existing performance tests or benchmarks? >>>> - Any thoughts on the storage approach or dictionary generation >>>> strategy? >>>> - Other considerations I might be missing? >>>> >>>> It seems like this would be a fairly easy win to improving density in >>>> clusters that are limited by disk space per node. It should also improve >>>> overall performance by reducing compression and decompression overhead. >>>> For the team I'm working with, we'd be reducing node count in AWS by >>>> several hundred nodes. We started with about 1K nodes at 4TB / node, and >>>> were able to remove roughly 700 with the introduction of CASSANDRA-15452 >>>> (now at approximately 13TB /node), and are looking to cut the number at >>>> least in half again. >>>> >>>> Looking forward to hearing your thoughts. >>>> >>>> Thanks, >>>> >>>> Jon >>>> [1] https://facebook.github.io/zstd/ >>>> >>>>