Hi Tom, On Thu, Mar 6, 2025 at 11:33 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > > Robert Haas <robertmh...@gmail.com> writes: > > On Thu, Mar 6, 2025 at 12:43 AM Nikhil Kumar Veldanda > > <veldanda.nikhilkuma...@gmail.com> wrote: > >> Notably, this is the first compression algorithm for Postgres that can > >> make use of a dictionary to provide higher levels of compression, but > >> dictionaries have to be generated and maintained, > > > I think that solving the problems around using a dictionary is going > > to be really hard. Can we see some evidence that the results will be > > worth it? > > BTW, this is hardly the first such attempt. See [1] for a prior > attempt at something fairly similar, which ended up going nowhere. > It'd be wise to understand why that failed before pressing forward. > > Note that the thread title for [1] is pretty misleading, as the > original discussion about JSONB-specific compression soon migrated > to discussion of compressing TOAST data using dictionaries. At > least from a ten-thousand-foot viewpoint, that seems like exactly > what you're proposing here. I see that you dismissed [1] as > irrelevant upthread, but I think you'd better look closer. > > regards, tom lane > > [1] > https://www.postgresql.org/message-id/flat/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22%3D5xVBg7S4vr5rQ%40mail.gmail.com
Thank you for highlighting the previous discussion—I reviewed [1] closely. While both methods involve dictionary-based compression, the approach I'm proposing differs significantly. The previous method explicitly extracted string values from JSONB and assigned unique OIDs to each entry, resulting in distinct dictionary entries for every unique value. In contrast, this approach directly leverages Zstandard's dictionary training API. We provide raw data samples to Zstd, which generates a dictionary of a specified size. This dictionary is then stored in a catalog table and used to compress subsequent inserts for the specific attribute it was trained on. Key differences include: 1. No new data types are required. 2. Attributes can optionally have multiple dictionaries; the latest dictionary is used during compression, and the exact dictionary used during compression is retrieved and applied for decompression. 3. Compression utilizes Zstandard's trained dictionaries when available. Additionally, I have provided an option for users to define custom sampling and training logic, as directly passing raw buffers to the training API may not always yield optimal results, especially for certain custom variable-length data types. This flexibility motivates the necessary adjustments to `pg_type`. I would greatly appreciate your feedback or any additional suggestions you might have. [1] https://www.postgresql.org/message-id/flat/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22%3D5xVBg7S4vr5rQ%40mail.gmail.com Best regards, Nikhil Veldanda