Re: [question] Status of dictionary preload compression?

Gao Xiang Fri, 25 Apr 2025 05:42:08 -0700



On 2025/4/25 12:20, Simon Hosie wrote:


24 Apr 2025, 05:02 by hsiang...@linux.alibaba.com:

...

I'm not sure if I can find the time to do the research myself, but I think it's 
at least important to note all my assumptions and open questions anyway.  That 
might make it easier to formalise into a set of research tasks for an 
interested volunteer.

But I might find time to experiment with it.  But I should focus on my day job. 
 But just in case, is there a test corpus for benchmarking filesystem 
compression that I should run tests on?


There is no golden test corpus, but dictionary compression should
have real users if implemented, e.g. Android system image cases:

Currently most real vendors uses 4k lz4hc compression for Android,
I think dictionary compression should help reduce the image sizes.

So you could benchmark with an Android system data.

* How big do the dictionaries need to be?  Do they have to all be the same 
size?  I think they certainly have to be multiples of the MMU page size so they 
can be page-aligned on disk.


I think it'd be much better to aligned with fs block size
(typically 4k), but multiple of 4k can be also accepted.


* If a small dictionary suffices, is it ok to pack two unrelated dictionaries 
together in the same slot so that two different file types can use different 
parts of the same dictionary?


I think it's ok and can be implemented.


* Is it true that the needs of all realistic filesystems can be met with fewer 
than 256 dictionaries for the whole system?  How many is a reasonable goal or a 
reasonable upper limit?


I'm not sure.


* Are there cases where multiple dictionaries per file have enough impact to 
justify the complexity?


But currently files consists of EROFS pclusters, if
compressed deduplication is on, some file could use
a pcluster from another files.

So we have to implement multiple dictionaries per
file, otherwise the on-disk format is flaw.


Of course, it might actually be easier to implement if the dictionary number is 
specified separately on every cluster?  In which case, it's definitely better 
to allow that flexibility.   Even if the default behaviour is to just use the 
same dictionary for the whole file, it's a tiny overhead which could be used 
better in future revisions of mkfs.


I think so, but we still have to know how many
bits are enough to represent the dictionary ID.

Also how to train those dictionaries efficiently.


I think for the tail merging cases, the natural thing to do is to only merge 
tails of files of the same type.


Currently there is no such strategy, also because
EROFS supports compressed deduplication, the
duplicated tails will be deduplicated too
regardless of the order.

But I understand you're saying the dictionary
efficiency.

Or only files using the same compression dictionary.  Even without a dictionary 
involved it should always be preferable to merge tails of files of the same 
type because they're much more likely to share strings which can compress 
together.  It's not optimal to merge the tail of an HTML file with the tail of 
a PNG file and try to share the same compression, for example.  Merge all the 
HTML tails together first.


Yeah, agreed here.

Thanks,
Gao Xiang

Re: [question] Status of dictionary preload compression?

Reply via email to