Re: [question] Status of dictionary preload compression?

Simon Hosie Thu, 24 Apr 2025 21:20:43 -0700


24 Apr 2025, 05:02 by hsiang...@linux.alibaba.com:


> On 2025/4/24 13:14, Simon Hosie wrote:
>
>> 23 Apr 2025, 17:27 by hsiang...@linux.alibaba.com:
>>
>>> Hi Simon,
>>>
>>> On 2025/4/24 03:24, Simon Hosie wrote:
>>>
>>>> I've struggled to determine if this is already a feature or in development 
>>>> or not (possibly because of overloading of the term "dictionary"), so I 
>>>> apologise in advance if the following brief is redundant:
>>>>
>>>> Compressors like LZ4, zstd, and even gzip talk about "dictionary 
>>>> compression" meaning to pre-load the history window of the compressor and 
>>>> decompressor, before the file is processed, with pre-arranged patterns; so 
>>>> that back references can be made for text the first time it appears in the 
>>>> file, rather than having to build up that window from an empty set at the 
>>>> beginning of the file by encoding everything as literals.
>>>>
>>>> This can lead to an improvement in compression ratio.
>>>>
>>>> It's generally only useful for small files because in a larger file the 
>>>> back-reference widow is established early and remains full of reference 
>>>> material for the rest of the file; but this should also benefit 
>>>> block-based compression which faces a loss of history at every entry point.
>>>>
>>>> So that's what I'm talking about; and my question, simply, is is this is a 
>>>> feature (or a planned feature) of erofs?  Something involving storing a 
>>>> set of uncompressed dictionary preload chunks within the filesystem which 
>>>> are then used as the starting dictionary when compressing and 
>>>> decompressing the small chunks of each file?
>>>>
>>>> In my imagination such a filesystem might provide a palette of 
>>>> uncompressed, and page-aligned, dictionaries and each file (or each 
>>>> cluster?) would give an index to the entry which it will use.  Typically 
>>>> that choice might be implied by the file type, but sometimes files can 
>>>> have different dispositions as you seek through them, or a .txt file may 
>>>> contain English or Chinese or ASCII art, each demanding different 
>>>> dictionaries.  Making the right choice is an external optimisation problem.
>>>>
>>>
>>> Thanks for your interest.
>>>
>>> I know the dictionary compression (and the benefit for small
>>> compression units as you said 4KiB compression) and it's on
>>> our TODO list for years.
>>>
>>> Actually I made an erofs-utils dictionary compresion demo 4
>>> years ago (but EROFS doesn't implement compression deduplication
>>> at that time):
>>> https://github.com/erofs/erofs-utils/tree/experimental-dictdemo
>>>
>>> The discussion part of this topic is the dictionary granularity:
>>>  1) per-filesystem ?  I think it's almost useless, but it
>>>  has least extra dictionary I/O.
>>>  2) per-inode?
>>>  3) per-(sub)inode?
>>>
>>> Since EROFS also supports compressed data deduplication (which
>>> means a pcluster can be used for different parts of an inode or
>>> different inodes), it makes the design for dictionary generation
>>> (since some uncompressed data can be deduplicated) and selection
>>> harder.
>>>
>>> If you have more ideas about the dictionary granularity and
>>> the whole process, I'm very interested in hearing more.
>>>
>>> Thanks,
>>> Gao Xiang
>>>
>> Ok, well the model I had in mind was to allocate maybe a few hundred kB of 
>> the filesystem image to various dictionaries optimised for different file 
>> types.  Some plain text dictionaries, a generic JSON dictionary, a generic 
>> XML dictionary, a shared object dictionary, etc..., and you enumerate those 
>> so that each file can choose the right dictionary using just a one-byte 
>> index into the table.
>>
>> Hopefully an app will only work intensively with a couple of file types at a 
>> time so only a couple of dictionaries need be paged in at a time.
>> My intuition tells me that diminishing returns will have set in well before 
>> 256 different dictionaries, and that having more than that would be mostly 
>> harmful; but I guess that's a thing which needs testing.
>> I understand right now every file can specify a compression method for 
>> itself, so in that same data structure it makes sense to also specify the 
>> index number of the dictionary for the compression if the chosen compression 
>> method uses a dictionary.
>>
>> By default mkfs would probably just look at the input files and a directory 
>> full of pre-cooked dictionaries, and based on the extension of the file and 
>> the availability of a matching dictionary it would put that dictionary in 
>> the next slot and mark all matching files as being compressed against that 
>> dictionary number.
>>
>> Then the user can look at which sets of file types missed out on a 
>> dictionary and how much space they're using, and they can decide if they 
>> want to make another dictionary for those files as well, or just make them 
>> share a dictionary with another file type. Or maybe they want to split a 
>> group of files because they'll benefit from having different versions of a 
>> dictionary.  Or maybe they'll write their own optimiser to decide the 
>> optimal file groups by grinding on the problem with hundreds of GPUs for 
>> weeks on end.
>>
>> If one file is particularly large it might warrant a dedicated dictionary, 
>> but I wouldn't plan for that as the general case.
>>
>> Once those decisions have been finalised the dictionaries can be 
>> re-optimised against the actual files to get the best compression.
>>
>> There's also the question of whether a file would want to select two or 
>> three small dictionaries to concatenate.  Like, for an XML file containing 
>> English tags and Chinese text, or something like that.  Or it might want to 
>> switch dictionaries halfway through the file.  That's probably 
>> over-engineering, though.
>>
>> I think that's more or less all the things I've thought about the problem so 
>> far.
>>
>
> Ok, I know your idea on this. My overall thought is to get the
> numbers and demo first, and then think more how to land this.
> If you're interested in developping this, that's awesome!
>
> BTW, one part I'm not sure if you noticed is that small files
> (or even the whole files) can be moved to the virtual fragment
> inode and compression together by using `-Efragments` (or even
> `-Eall-fragments`) option.
>
> So I think we have to develop sub-file dictionary support,
> otherwise it will be conflict with fragment compression feature.
>
> Thanks,
> Gao Xiang
>
I'm not sure if I can find the time to do the research myself, but I think it's 
at least important to note all my assumptions and open questions anyway.  That 
might make it easier to formalise into a set of research tasks for an 
interested volunteer.

But I might find time to experiment with it.  But I should focus on my day job. 
 But just in case, is there a test corpus for benchmarking filesystem 
compression that I should run tests on?
* How big do the dictionaries need to be?  Do they have to all be the same 
size?  I think they certainly have to be multiples of the MMU page size so they 
can be page-aligned on disk.

* If a small dictionary suffices, is it ok to pack two unrelated dictionaries 
together in the same slot so that two different file types can use different 
parts of the same dictionary?

* Is it true that the needs of all realistic filesystems can be met with fewer 
than 256 dictionaries for the whole system?  How many is a reasonable goal or a 
reasonable upper limit?

* Are there cases where multiple dictionaries per file have enough impact to 
justify the complexity?

Of course, it might actually be easier to implement if the dictionary number is 
specified separately on every cluster?  In which case, it's definitely better 
to allow that flexibility.   Even if the default behaviour is to just use the 
same dictionary for the whole file, it's a tiny overhead which could be used 
better in future revisions of mkfs.

I think for the tail merging cases, the natural thing to do is to only merge 
tails of files of the same type.  Or only files using the same compression 
dictionary.  Even without a dictionary involved it should always be preferable 
to merge tails of files of the same type because they're much more likely to 
share strings which can compress together.  It's not optimal to merge the tail 
of an HTML file with the tail of a PNG file and try to share the same 
compression, for example.  Merge all the HTML tails together first.

Re: [question] Status of dictionary preload compression?

Reply via email to