The attachment "Small patch to inline ZSTD_copy(4|8|16)" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu- reviewers, unsubscribe the team.
[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.] ** Tags added: patch -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to zfs-linux in Ubuntu. https://bugs.launchpad.net/bugs/2099745 Title: Poor code generation in shipped zfs.ko resulting in ~5x zstd decompression (and likely compression too) slowdown. Freshly built zfs.ko from zfs-dkms is fine Status in zfs-linux package in Ubuntu: New Bug description: 1. ********* System info *********: - lscpu | grep -E "Model name|Flags" Model name: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities - lsb_release -rd Description: Ubuntu 24.04.2 LTS Release: 24.04 apt-cache policy linux-modules-6.8.0-53-generic zfs-dkms linux-modules-6.8.0-53-generic: Installed: 6.8.0-53.55 Candidate: 6.8.0-53.55 Version table: *** 6.8.0-53.55 500 500 http://gb.archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages 100 /var/lib/dpkg/status zfs-dkms: Installed: 2.2.2-0ubuntu9.1 Candidate: 2.2.2-0ubuntu9.1 Version table: *** 2.2.2-0ubuntu9.1 500 500 http://gb.archive.ubuntu.com/ubuntu noble-updates/universe amd64 Packages 100 /var/lib/dpkg/status 2.2.2-0ubuntu9 500 500 http://gb.archive.ubuntu.com/ubuntu noble/universe amd64 Packages 2. ************ Summary *************: - Poor code generation in the main zstd decompression loop in the shipped zfs.ko is resulting in a ~5x decompression slowdown compared to a freshly built zfs.ko from zfs-dkms. This issue is also likely affecting zstd compression. 3. *********** Reproduction *********: - Create a test dataset with encryption disabled (to remove a possible cause): sudo zfs create -o encryption=off -o recordsize=128k -o compression=zstd rpool/test - Get some compressible file around ~500MB in size (anything will do as long as it is compressible). For example: wget https://download.qt.io/archive/qt/4.8/4.8.2/qt-everywhere-opensource-src-4.8.2.tar.gz gzip -d qt-everywhere-opensource-src-4.8.2.tar.gz - Benchmark sequential reads with fio: mv -v qt-everywhere-opensource-src-4.8.2.tar qt-everywhere-opensource-src-4.8.2.tar.0.0 zpool sync # Run the fio command a few times to ensure the data is in the ARC. fio --name=qt-everywhere-opensource-src-4.8.2.tar --bs=128k --rw=read --numjobs=1 --iodepth=1 --size=500M --group_reporting --loops=10 4. ********** Expected results ********: - The fio command in step 3 benchmarks sequential cached reads from the ARC. We are therefore measuring single-threaded zstd decompression performance in zfs.ko. - With the shipped zfs.ko: qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=randread, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1 fio-3.36 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=205MiB/s][r=1642 IOPS][eta 00m:00s] qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: pid=4354: Fri Feb 21 13:22:43 2025 read: IOPS=1651, BW=206MiB/s (216MB/s)(5000MiB/24223msec) clat (usec): min=12, max=2677, avg=604.97, stdev=459.39 lat (usec): min=12, max=2677, avg=605.02, stdev=459.39 clat percentiles (usec): | 1.00th=[ 23], 5.00th=[ 25], 10.00th=[ 25], 20.00th=[ 27], | 30.00th=[ 202], 40.00th=[ 519], 50.00th=[ 660], 60.00th=[ 758], | 70.00th=[ 881], 80.00th=[ 1029], 90.00th=[ 1156], 95.00th=[ 1352], | 99.00th=[ 1680], 99.50th=[ 1745], 99.90th=[ 2057], 99.95th=[ 2114], | 99.99th=[ 2180] bw ( KiB/s): min=199936, max=219648, per=99.95%, avg=211264.00, stdev=5629.40, samples=48 iops : min= 1562, max= 1716, avg=1650.50, stdev=43.98, samples=48 lat (usec) : 20=0.02%, 50=27.77%, 100=0.62%, 250=2.70%, 500=7.67% lat (usec) : 750=20.08%, 1000=19.07% lat (msec) : 2=21.91%, 4=0.17% cpu : usr=0.19%, sys=99.72%, ctx=79, majf=0, minf=41 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=206MiB/s (216MB/s), 206MiB/s-206MiB/s (216MB/s-216MB/s), io=5000MiB (5243MB), run=24223-24223msec - With zfs.ko from zfs-dkms: qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1 fio-3.36 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=1152MiB/s][r=9219 IOPS][eta 00m:00s] qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: pid=692842: Fri Feb 21 13:11:39 2025 read: IOPS=8871, BW=1109MiB/s (1163MB/s)(5000MiB/4509msec) clat (usec): min=9, max=770, avg=112.26, stdev=96.85 lat (usec): min=9, max=770, avg=112.29, stdev=96.85 clat percentiles (usec): | 1.00th=[ 11], 5.00th=[ 12], 10.00th=[ 12], 20.00th=[ 16], | 30.00th=[ 23], 40.00th=[ 25], 50.00th=[ 122], 60.00th=[ 151], | 70.00th=[ 174], 80.00th=[ 200], 90.00th=[ 229], 95.00th=[ 249], | 99.00th=[ 322], 99.50th=[ 553], 99.90th=[ 644], 99.95th=[ 676], | 99.99th=[ 725] bw ( MiB/s): min= 936, max= 1466, per=100.00%, avg=1109.14, stdev=148.75, samples=9 iops : min= 7488, max=11734, avg=8873.11, stdev=1190.03, samples=9 lat (usec) : 10=0.10%, 20=20.52%, 50=22.47%, 100=3.51%, 250=48.45% lat (usec) : 500=4.35%, 750=0.60%, 1000=0.01% cpu : usr=0.67%, sys=99.29%, ctx=15, majf=0, minf=42 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=1109MiB/s (1163MB/s), 1109MiB/s-1109MiB/s (1163MB/s-1163MB/s), io=5000MiB (5243MB), run=4509-4509msec - There is therefore a ~5x decompression performance difference between the shipped zfs.ko and zfs.ko from zfs-dkms. As a baseline, I have built and benchmarked zstd 1.4.5 (the version used in ZFS) on the same file: $ ./zstd-git/programs/zstd --version *** zstd command line interface 64-bits v1.4.5, by Yann Collet *** $ ./zstd-git/programs/zstd -b3 -B128KB qt-everywhere-opensource-src-4.8.2.tar.0.0 3#src-4.8.2.tar.0.0 : 647249920 -> 247589078 (2.614), 370.5 MB/s ,1861.8 MB/s 5. ********** Investigation ************: - I used perf to profile the fio command, both with the shipped zfs.ko and the freshly built zfs.ko from zfs-dkms. The two flame graphs are attached. - With both versions of zfs.ko, the majority of the time is spent inside ZSTD_decompressSequences_bmi2.constprop.0. As expected, in both cases the BMI2 version of ZSTD_decompressSequences is selected. - However in the shipped zfs.ko the main loop of ZSTD_decompressSequences_bmi2.constprop.0 contains many calls to small functions that should have been inlined (MEM_64bits, MEM_32bits, BIT_reloadDStream, BIT_readBits, BIT_readBitsFast, ZSTD_copy16). MEM_64bits and MEM_32bits are particularly bad since their definitions are: MEM_STATIC unsigned MEM_32bits(void) { return sizeof(size_t)==4; } MEM_STATIC unsigned MEM_64bits(void) { return sizeof(size_t)==8; } - In the freshly built zfs.ko from zfs-dkms these small functions have been inlined. This has allowed the compiler to actually make use of BMI2 instructions (shlx and shrx). - As far as I can tell all shipped zfs.ko are affected by this problem. I have looked at zfs.ko in: - 5.15.0-133-generic - 6.8.0-56-generic - 6.11.0-18-generic - 6.12.0-15-generic and they all appear to have the same issue. - Zstd compression is also likely affected since the main loop ZSTD_encodeSequences_bmi2 appears to have the same issue. I have however not tested it. - There remains a few calls to ZSTD_copy4, ZSTD_copy8 and ZSTD_copy16 that should be inlined. A small patch (attached) adding MEM_STATIC to ZSTD_copy(4,8,16) results in a 12% decompression performance improvement: qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=randread, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1 fio-3.36 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=1253MiB/s][r=10.0k IOPS][eta 00m:00s] qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: pid=4694: Fri Feb 21 16:52:40 2025 read: IOPS=9990, BW=1249MiB/s (1309MB/s)(5000MiB/4004msec) clat (usec): min=18, max=671, avg=99.67, stdev=66.00 lat (usec): min=18, max=671, avg=99.70, stdev=66.00 clat percentiles (usec): | 1.00th=[ 21], 5.00th=[ 23], 10.00th=[ 23], 20.00th=[ 25], | 30.00th=[ 48], 40.00th=[ 91], 50.00th=[ 108], 60.00th=[ 119], | 70.00th=[ 133], 80.00th=[ 153], 90.00th=[ 167], 95.00th=[ 188], | 99.00th=[ 231], 99.50th=[ 494], 99.90th=[ 578], 99.95th=[ 603], | 99.99th=[ 627] bw ( MiB/s): min= 1234, max= 1258, per=100.00%, avg=1249.69, stdev= 7.86, samples=8 iops : min= 9876, max=10066, avg=9997.50, stdev=62.86, samples=8 lat (usec) : 20=0.17%, 50=30.22%, 100=15.52%, 250=53.30%, 500=0.32% lat (usec) : 750=0.47% cpu : usr=0.95%, sys=99.00%, ctx=7, majf=0, minf=41 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=1249MiB/s (1309MB/s), 1249MiB/s-1249MiB/s (1309MB/s-1309MB/s), io=5000MiB (5243MB), run=4004-4004msec 6. ********** Attached files *********: - Flame graph of the fio command described in step 4 with both the shipped zfs.ko and the freshly built zfs.ko from zfs-dkms - zfs-shipped.ko and zfs-dkms.ko - patch0-MEM_STATIC-on-ZSTD_copy4-8-16.patch To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/2099745/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp