[question] Status of dictionary preload compression?

Simon Hosie Wed, 23 Apr 2025 12:33:15 -0700

I've struggled to determine if this is already a feature or in development or 
not (possibly because of overloading of the term "dictionary"), so I apologise 
in advance if the following brief is redundant:


Compressors like LZ4, zstd, and even gzip talk about "dictionary compression" 
meaning to pre-load the history window of the compressor and decompressor, 
before the file is processed, with pre-arranged patterns; so that back 
references can be made for text the first time it appears in the file, rather 
than having to build up that window from an empty set at the beginning of the 
file by encoding everything as literals.

This can lead to an improvement in compression ratio.

It's generally only useful for small files because in a larger file the 
back-reference widow is established early and remains full of reference 
material for the rest of the file; but this should also benefit block-based 
compression which faces a loss of history at every entry point.

So that's what I'm talking about; and my question, simply, is is this is a 
feature (or a planned feature) of erofs?  Something involving storing a set of 
uncompressed dictionary preload chunks within the filesystem which are then 
used as the starting dictionary when compressing and decompressing the small 
chunks of each file?

In my imagination such a filesystem might provide a palette of uncompressed, 
and page-aligned, dictionaries and each file (or each cluster?) would give an 
index to the entry which it will use.  Typically that choice might be implied 
by the file type, but sometimes files can have different dispositions as you 
seek through them, or a .txt file may contain English or Chinese or ASCII art, 
each demanding different dictionaries.  Making the right choice is an external 
optimisation problem.

[question] Status of dictionary preload compression?

Reply via email to