Hi Robert, Thanks for raising that question. The idea behind including a 24-bit length field alongside the 1-byte algorithm ID is to ensure that each compressed datum self-describes its metadata size. This allows any compression algorithm to embed variable-length metadata (up to 16 MB) without the need for hard-coding header sizes. For instance, an algorithm in feature might require different metadata lengths for each datum, and a fixed header size table wouldn’t work. By storing the length in the header, we maintain a generic and future-proof design. I would greatly appreciate any feedback on this design. Thanks!
On Mon, Apr 28, 2025 at 7:50 AM Robert Haas <robertmh...@gmail.com> wrote: > > On Fri, Apr 25, 2025 at 11:15 AM Nikhil Kumar Veldanda > <veldanda.nikhilkuma...@gmail.com> wrote: > > a. 24 bits for length → per-datum compression algorithm metadata is > > capped at 16 MB, which is far more than any realistic compression > > header. > > b. 8 bits for algorithm id → up to 256 algorithms. > > c. Zero-overhead when unused if an algorithm needs no per-datum > > metadata (e.g., ZSTD-nodict), > > I don't understand why we need to spend 24 bits on a length header > here. I agree with the idea of adding a 1-byte quantity for algorithm > here, but I don't see why we need anything more than that. If the > compression method is zstd-with-a-dict, then the payload data > presumably needs to start with the OID of the dictionary, but it seems > like in your schema every single datum would use these 3 bytes to > store the fact that sizeof(Oid) = 4. The code that interprets > zstd-with-dict datums should already know the header length. Even if > generic code that works with all types of compression needs to be able > to obtain the header length on a per-compression-type basis, there can > be some kind of callback or table for that, rather than storing it in > every single datum. > > -- > Robert Haas > EDB: http://www.enterprisedb.com -- Nikhil Veldanda