Hi Simon,
On 2025/4/24 03:24, Simon Hosie wrote:
I've struggled to determine if this is already a feature or in development or not
(possibly because of overloading of the term "dictionary"), so I apologise in
advance if the following brief is redundant:
Compressors like LZ4, zstd, and even gzip talk about "dictionary compression"
meaning to pre-load the history window of the compressor and decompressor, before the
file is processed, with pre-arranged patterns; so that back references can be made for
text the first time it appears in the file, rather than having to build up that window
from an empty set at the beginning of the file by encoding everything as literals.
This can lead to an improvement in compression ratio.
It's generally only useful for small files because in a larger file the
back-reference widow is established early and remains full of reference
material for the rest of the file; but this should also benefit block-based
compression which faces a loss of history at every entry point.
So that's what I'm talking about; and my question, simply, is is this is a
feature (or a planned feature) of erofs? Something involving storing a set of
uncompressed dictionary preload chunks within the filesystem which are then
used as the starting dictionary when compressing and decompressing the small
chunks of each file?
In my imagination such a filesystem might provide a palette of uncompressed,
and page-aligned, dictionaries and each file (or each cluster?) would give an
index to the entry which it will use. Typically that choice might be implied
by the file type, but sometimes files can have different dispositions as you
seek through them, or a .txt file may contain English or Chinese or ASCII art,
each demanding different dictionaries. Making the right choice is an external
optimisation problem.
Thanks for your interest.
I know the dictionary compression (and the benefit for small
compression units as you said 4KiB compression) and it's on
our TODO list for years.
Actually I made an erofs-utils dictionary compresion demo 4
years ago (but EROFS doesn't implement compression deduplication
at that time):
https://github.com/erofs/erofs-utils/tree/experimental-dictdemo
The discussion part of this topic is the dictionary granularity:
1) per-filesystem ? I think it's almost useless, but it
has least extra dictionary I/O.
2) per-inode?
3) per-(sub)inode?
Since EROFS also supports compressed data deduplication (which
means a pcluster can be used for different parts of an inode or
different inodes), it makes the design for dictionary generation
(since some uncompressed data can be deduplicated) and selection
harder.
If you have more ideas about the dictionary granularity and
the whole process, I'm very interested in hearing more.
Thanks,
Gao Xiang