Hi Lee,
A quick search shows that x86 kernel implementation also use `div` instruction
under Linux v6.19 and GCC 15.2.1, add GCC correctly generate shift instruction
in my arm64 machine with GCC 14.2.0. Could you also consider evaluate this optimization in kernel? Thanks, Yifan On 2/26/2026 1:30 AM, Ashley Lee wrote:
perf on fsck.erofs in gcc reports that z_erofs_load_compact_lcluster was spending 20% of its time doing the div instruction. While the function itself is ~40% of user runtime. In the source code, it seems that dividing by vcnt doesn't optimize to a shift despite the two possible states being powers of 2. Changing the division into a ilog2() function call encourages the compiler to recognize it as a power of 2. Thus performing a shift. Running a benchmark on lzma compressed freebsd code on x86, shows there is a ~4% increase in performance in gcc. While clang shows virtually no regression in performance. The tradeoff is slightly obfuscated source code. The following command was run locally on x86. $ hyperfine -w 5 -m 30 "./fsck.erofs ./bsd.erofs.lzma" patch on gcc 15.2.1 Time (mean ± σ): 354.8 ms ± 6.0 ms \ [User: 227.8 ms, System: 126.1 ms] Range (min … max): 345.8 ms … 366.2 ms 30 runs dev on gcc 15.2.1 Time (mean ± σ): 370.7 ms ± 6.7 ms \ [User: 246.5 ms, System: 123.4 ms] Range (min … max): 362.7 ms … 390.7 ms 30 runs patch on clang 21.1.8 Time (mean ± σ): 371.9 ms ± 2.4 ms \ [User: 247.2 ms, System: 123.9 ms] Range (min … max): 369.1 ms … 380.0 ms 30 runs dev on clang 21.1.8 Time (mean ± σ): 371.0 ms ± 1.9 ms \ [User: 245.5 ms, System: 124.5 ms] Range (min … max): 368.4 ms … 377.7 ms 30 runs Signed-off-by: Ashley Lee <[email protected]> --- v2: changed vdiv to ilog2 call lib/zmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/zmap.c b/lib/zmap.c index baec278..3ac7fe9 100644 --- a/lib/zmap.c +++ b/lib/zmap.c @@ -160,7 +160,7 @@ static int z_erofs_load_compact_lcluster(struct z_erofs_maprecorder *m, m->nextpackoff = round_down(pos, vcnt << amortizedshift) + (vcnt << amortizedshift); lobits = max(lclusterbits, ilog2(Z_EROFS_LI_D0_CBLKCNT) + 1U); - encodebits = ((vcnt << amortizedshift) - sizeof(__le32)) * 8 / vcnt; + encodebits = (((vcnt << amortizedshift) - sizeof(__le32)) * 8) >> ilog2(vcnt); bytes = pos & ((vcnt << amortizedshift) - 1); in -= bytes; i = bytes >> amortizedshift;
