Hi Nithurshen,

On 2026/3/2 15:32, Nithurshen wrote:
This is a Proof of Concept to introduce a scalable, multi-threaded
decompression framework into fsck.erofs to reduce extraction time.

Baseline Profiling:
Using the Linux 6.7 kernel source packed with LZ4HC (4K pclusters),
perf showed a strictly synchronous execution path. The main thread
spent ~52% of its time in LZ4_decompress_safe, heavily blocked by
synchronous I/O (~32% in el0_svc/vfs_read).

First Iteration (Naive Workqueue):
A standard producer-consumer workqueue overlapping compute with pwrite()
suffered massive scheduling overhead. For 4KB LZ4 clusters, workers
spent ~44% of CPU time spinning on __arm64_sys_futex and try_to_wake_up.

Current PoC (Dynamic Pcluster Batching):
To eliminate lock contention, this patch introduces a batching context.
Instead of queuing 1 pcluster per task, the main thread collects an
array of sequential pclusters (Z_EROFS_PCLUSTER_BATCH_SIZE = 32) before
submitting a single erofs_work unit.

Results:
- Scheduling overhead (futex) dropped significantly.
- Workers stay cache-hot, decompressing 32 blocks per wakeup.
- LZ4_decompress_safe is successfully offloaded to background cores
   (~18.8% self-execution time), completely decoupled from main thread I/O.

Glad to see the improvements, I think there are more room
to be improved anyway.

Also there are still some follow-up works, I'm busy these
two days, but I will release a formal gsoc page later.

Thanks,
Gao Xiang

Reply via email to