On 2025/4/25 12:20, Simon Hosie wrote:
24 Apr 2025, 05:02 by hsiang...@linux.alibaba.com:
...
I'm not sure if I can find the time to do the research myself, but I think it's at least important to note all my assumptions and open questions anyway. That might make it easier to formalise into a set of research tasks for an interested volunteer. But I might find time to experiment with it. But I should focus on my day job. But just in case, is there a test corpus for benchmarking filesystem compression that I should run tests on?
There is no golden test corpus, but dictionary compression should have real users if implemented, e.g. Android system image cases: Currently most real vendors uses 4k lz4hc compression for Android, I think dictionary compression should help reduce the image sizes. So you could benchmark with an Android system data.
* How big do the dictionaries need to be? Do they have to all be the same size? I think they certainly have to be multiples of the MMU page size so they can be page-aligned on disk.
I think it'd be much better to aligned with fs block size (typically 4k), but multiple of 4k can be also accepted.
* If a small dictionary suffices, is it ok to pack two unrelated dictionaries together in the same slot so that two different file types can use different parts of the same dictionary?
I think it's ok and can be implemented.
* Is it true that the needs of all realistic filesystems can be met with fewer than 256 dictionaries for the whole system? How many is a reasonable goal or a reasonable upper limit?
I'm not sure.
* Are there cases where multiple dictionaries per file have enough impact to justify the complexity?
But currently files consists of EROFS pclusters, if compressed deduplication is on, some file could use a pcluster from another files. So we have to implement multiple dictionaries per file, otherwise the on-disk format is flaw.
Of course, it might actually be easier to implement if the dictionary number is specified separately on every cluster? In which case, it's definitely better to allow that flexibility. Even if the default behaviour is to just use the same dictionary for the whole file, it's a tiny overhead which could be used better in future revisions of mkfs.
I think so, but we still have to know how many bits are enough to represent the dictionary ID. Also how to train those dictionaries efficiently.
I think for the tail merging cases, the natural thing to do is to only merge tails of files of the same type.
Currently there is no such strategy, also because EROFS supports compressed deduplication, the duplicated tails will be deduplicated too regardless of the order. But I understand you're saying the dictionary efficiency.
Or only files using the same compression dictionary. Even without a dictionary involved it should always be preferable to merge tails of files of the same type because they're much more likely to share strings which can compress together. It's not optimal to merge the tail of an HTML file with the tail of a PNG file and try to share the same compression, for example. Merge all the HTML tails together first.
Yeah, agreed here. Thanks, Gao Xiang