** Description changed:

  1. *********  System info *********:
    - lscpu | grep -E "Model name|Flags"
  
  Model name:                           Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
  Flags:                                fpu vme de pse tsc msr pae mce cx8 apic 
sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good 
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl 
vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe 
popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch 
cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid 
ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap 
clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp 
hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities
  
    - lsb_release -rd
  
  Description:  Ubuntu 24.04.2 LTS
  Release:  24.04
  
  apt-cache policy linux-modules-6.8.0-53-generic zfs-dkms
  
  linux-modules-6.8.0-53-generic:
    Installed: 6.8.0-53.55
    Candidate: 6.8.0-53.55
    Version table:
   *** 6.8.0-53.55 500
          500 http://gb.archive.ubuntu.com/ubuntu noble-updates/main amd64 
Packages
          100 /var/lib/dpkg/status
  zfs-dkms:
    Installed: 2.2.2-0ubuntu9.1
    Candidate: 2.2.2-0ubuntu9.1
    Version table:
   *** 2.2.2-0ubuntu9.1 500
          500 http://gb.archive.ubuntu.com/ubuntu noble-updates/universe amd64 
Packages
          100 /var/lib/dpkg/status
       2.2.2-0ubuntu9 500
          500 http://gb.archive.ubuntu.com/ubuntu noble/universe amd64 Packages
  
  2. ************ Summary *************:
    - Poor code generation in the main zstd decompression loop in the shipped 
zfs.ko
      is resulting in a ~5x decompression slowdown compared to a freshly built 
zfs.ko from zfs-dkms.
      This issue is also likely affecting zstd compression.
  
  3. *********** Reproduction *********:
    - Create a test dataset with encryption disabled (to remove a possible 
cause):
      sudo zfs create -o encryption=off -o recordsize=128k -o compression=zstd 
rpool/test
  
    - Get some compressible file around ~500MB in size (anything will do as 
long as it is compressible).
      For example: wget 
https://download.qt.io/archive/qt/4.8/4.8.2/qt-everywhere-opensource-src-4.8.2.tar.gz
      gzip -d qt-everywhere-opensource-src-4.8.2.tar.gz
  
    - Benchmark sequential reads with fio:
      mv -v qt-everywhere-opensource-src-4.8.2.tar 
qt-everywhere-opensource-src-4.8.2.tar.0.0
      zpool sync
      # Run the fio command a few times to ensure the data is in the ARC.
      fio --name=qt-everywhere-opensource-src-4.8.2.tar --bs=128k --rw=read 
--numjobs=1 --iodepth=1 --size=500M --group_reporting --loops=10
  
  4. ********** Expected results ********:
    - The fio command in step 3 benchmarks sequential cached reads from the ARC.
      We are therefore measuring single-threaded zstd decompression performance 
in zfs.ko.
  
    - With the shipped zfs.ko:
  
- qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=randread, bs=(R) 
128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
+ qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=read, bs=(R) 128KiB-128KiB, 
(W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
  fio-3.36
  Starting 1 process
- Jobs: 1 (f=1): [r(1)][100.0%][r=205MiB/s][r=1642 IOPS][eta 00m:00s]
- qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: 
pid=4354: Fri Feb 21 13:22:43 2025
-   read: IOPS=1651, BW=206MiB/s (216MB/s)(5000MiB/24223msec)
-     clat (usec): min=12, max=2677, avg=604.97, stdev=459.39
-      lat (usec): min=12, max=2677, avg=605.02, stdev=459.39
-     clat percentiles (usec):
-      |  1.00th=[   23],  5.00th=[   25], 10.00th=[   25], 20.00th=[   27],
-      | 30.00th=[  202], 40.00th=[  519], 50.00th=[  660], 60.00th=[  758],
-      | 70.00th=[  881], 80.00th=[ 1029], 90.00th=[ 1156], 95.00th=[ 1352],
-      | 99.00th=[ 1680], 99.50th=[ 1745], 99.90th=[ 2057], 99.95th=[ 2114],
-      | 99.99th=[ 2180]
-    bw (  KiB/s): min=199936, max=219648, per=99.95%, avg=211264.00, 
stdev=5629.40, samples=48
-    iops        : min= 1562, max= 1716, avg=1650.50, stdev=43.98, samples=48
-   lat (usec)   : 20=0.02%, 50=27.77%, 100=0.62%, 250=2.70%, 500=7.67%
-   lat (usec)   : 750=20.08%, 1000=19.07%
-   lat (msec)   : 2=21.91%, 4=0.17%
-   cpu          : usr=0.19%, sys=99.72%, ctx=79, majf=0, minf=41
-   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
-      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
-      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
-      issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
-      latency   : target=0, window=0, percentile=100.00%, depth=1
+ Jobs: 1 (f=1): [R(1)][100.0%][r=236MiB/s][r=1886 IOPS][eta 00m:00s]
+ qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: 
pid=4121: Thu Feb 27 15:50:38 2025
+   read: IOPS=1625, BW=203MiB/s (213MB/s)(5000MiB/24610msec)
+     clat (usec): min=18, max=4338, avg=613.58, stdev=469.32
+      lat (usec): min=18, max=4339, avg=613.62, stdev=469.32
+     clat percentiles (usec):
+      |  1.00th=[   21],  5.00th=[   23], 10.00th=[   23], 20.00th=[   25],
+      | 30.00th=[  208], 40.00th=[  523], 50.00th=[  668], 60.00th=[  766],
+      | 70.00th=[  889], 80.00th=[ 1037], 90.00th=[ 1172], 95.00th=[ 1385],
+      | 99.00th=[ 1713], 99.50th=[ 1811], 99.90th=[ 2180], 99.95th=[ 2409],
+      | 99.99th=[ 2966]
+    bw (  KiB/s): min=111360, max=355072, per=100.00%, avg=208216.82, 
stdev=83337.72, samples=49
+    iops        : min=  870, max= 2774, avg=1626.69, stdev=651.08, samples=49
+   lat (usec)   : 20=0.16%, 50=26.87%, 100=1.18%, 250=2.73%, 500=7.59%
+   lat (usec)   : 750=19.73%, 1000=18.87%
+   lat (msec)   : 2=22.64%, 4=0.25%, 10=0.01%
+   cpu          : usr=0.20%, sys=99.50%, ctx=178, majf=15, minf=42
+   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
+      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
+      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
+      issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
+      latency   : target=0, window=0, percentile=100.00%, depth=1
  
  Run status group 0 (all jobs):
-    READ: bw=206MiB/s (216MB/s), 206MiB/s-206MiB/s (216MB/s-216MB/s), 
io=5000MiB (5243MB), run=24223-24223msec
+    READ: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), 
io=5000MiB (5243MB), run=24610-24610msec
+ 
  
    - With zfs.ko from zfs-dkms:
  
  qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=read, bs=(R) 128KiB-128KiB, 
(W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
  fio-3.36
  Starting 1 process
  Jobs: 1 (f=1): [R(1)][100.0%][r=1152MiB/s][r=9219 IOPS][eta 00m:00s]
  qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: 
pid=692842: Fri Feb 21 13:11:39 2025
    read: IOPS=8871, BW=1109MiB/s (1163MB/s)(5000MiB/4509msec)
      clat (usec): min=9, max=770, avg=112.26, stdev=96.85
       lat (usec): min=9, max=770, avg=112.29, stdev=96.85
      clat percentiles (usec):
       |  1.00th=[   11],  5.00th=[   12], 10.00th=[   12], 20.00th=[   16],
       | 30.00th=[   23], 40.00th=[   25], 50.00th=[  122], 60.00th=[  151],
       | 70.00th=[  174], 80.00th=[  200], 90.00th=[  229], 95.00th=[  249],
       | 99.00th=[  322], 99.50th=[  553], 99.90th=[  644], 99.95th=[  676],
       | 99.99th=[  725]
     bw (  MiB/s): min=  936, max= 1466, per=100.00%, avg=1109.14, 
stdev=148.75, samples=9
     iops        : min= 7488, max=11734, avg=8873.11, stdev=1190.03, samples=9
    lat (usec)   : 10=0.10%, 20=20.52%, 50=22.47%, 100=3.51%, 250=48.45%
    lat (usec)   : 500=4.35%, 750=0.60%, 1000=0.01%
    cpu          : usr=0.67%, sys=99.29%, ctx=15, majf=0, minf=42
    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
       issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=1
  
  Run status group 0 (all jobs):
     READ: bw=1109MiB/s (1163MB/s), 1109MiB/s-1109MiB/s (1163MB/s-1163MB/s), 
io=5000MiB (5243MB), run=4509-4509msec
  
    - There is therefore a ~5x decompression performance difference between the 
shipped zfs.ko and zfs.ko from zfs-dkms.
      As a baseline, I have built and benchmarked zstd 1.4.5 (the version used 
in ZFS) on the same file:
  
      $ ./zstd-git/programs/zstd --version
      *** zstd command line interface 64-bits v1.4.5, by Yann Collet ***
  
      $ ./zstd-git/programs/zstd -b3 -B128KB 
qt-everywhere-opensource-src-4.8.2.tar.0.0
      3#src-4.8.2.tar.0.0 : 647249920 -> 247589078 (2.614), 370.5 MB/s ,1861.8 
MB/s
  
  5. ********** Investigation ************:
    - I used perf to profile the fio command, both with the shipped zfs.ko and 
the freshly built zfs.ko from zfs-dkms.
      The two flame graphs are attached.
  
    - With both versions of zfs.ko, the majority of the time is spent inside 
ZSTD_decompressSequences_bmi2.constprop.0.
      As expected, in both cases the BMI2 version of ZSTD_decompressSequences 
is selected.
  
    - However in the shipped zfs.ko the main loop of 
ZSTD_decompressSequences_bmi2.constprop.0 contains many calls
      to small functions that should have been inlined (MEM_64bits, MEM_32bits, 
BIT_reloadDStream, BIT_readBits, BIT_readBitsFast,
      ZSTD_copy16). MEM_64bits and MEM_32bits are particularly bad since their 
definitions are:
  
      MEM_STATIC unsigned MEM_32bits(void) { return sizeof(size_t)==4; }
      MEM_STATIC unsigned MEM_64bits(void) { return sizeof(size_t)==8; }
  
    - In the freshly built zfs.ko from zfs-dkms these small functions have been 
inlined.
      This has allowed the compiler to actually make use of BMI2 instructions 
(shlx and shrx).
  
    - As far as I can tell all shipped zfs.ko are affected by this problem. I 
have looked at zfs.ko in:
      - 5.15.0-133-generic
      - 6.8.0-56-generic
      - 6.11.0-18-generic
      - 6.12.0-15-generic
      and they all appear to have the same issue.
  
    - Zstd compression is also likely affected since the main loop 
ZSTD_encodeSequences_bmi2 appears to have the same issue.
      I have however not tested it.
  
    - There remains a few calls to ZSTD_copy4, ZSTD_copy8 and ZSTD_copy16 that 
should be inlined.
      A small patch (attached) adding MEM_STATIC to ZSTD_copy(4,8,16) results 
in a 12% decompression performance improvement:
  
- qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=randread, bs=(R) 
128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
+ qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=read, bs=(R) 128KiB-128KiB, 
(W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
  fio-3.36
  Starting 1 process
- Jobs: 1 (f=1): [r(1)][100.0%][r=1253MiB/s][r=10.0k IOPS][eta 00m:00s]
- qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: 
pid=4694: Fri Feb 21 16:52:40 2025
-   read: IOPS=9990, BW=1249MiB/s (1309MB/s)(5000MiB/4004msec)
-     clat (usec): min=18, max=671, avg=99.67, stdev=66.00
-      lat (usec): min=18, max=671, avg=99.70, stdev=66.00
-     clat percentiles (usec):
-      |  1.00th=[   21],  5.00th=[   23], 10.00th=[   23], 20.00th=[   25],
-      | 30.00th=[   48], 40.00th=[   91], 50.00th=[  108], 60.00th=[  119],
-      | 70.00th=[  133], 80.00th=[  153], 90.00th=[  167], 95.00th=[  188],
-      | 99.00th=[  231], 99.50th=[  494], 99.90th=[  578], 99.95th=[  603],
-      | 99.99th=[  627]
-    bw (  MiB/s): min= 1234, max= 1258, per=100.00%, avg=1249.69, stdev= 7.86, 
samples=8
-    iops        : min= 9876, max=10066, avg=9997.50, stdev=62.86, samples=8
-   lat (usec)   : 20=0.17%, 50=30.22%, 100=15.52%, 250=53.30%, 500=0.32%
-   lat (usec)   : 750=0.47%
-   cpu          : usr=0.95%, sys=99.00%, ctx=7, majf=0, minf=41
-   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
-      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
-      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
-      issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
-      latency   : target=0, window=0, percentile=100.00%, depth=1
+ Jobs: 1 (f=1): [R(1)][100.0%][r=1304MiB/s][r=10.4k IOPS][eta 00m:00s]
+ qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: 
pid=31764: Thu Feb 27 15:45:04 2025
+   read: IOPS=10.1k, BW=1257MiB/s (1318MB/s)(5000MiB/3977msec)
+     clat (usec): min=17, max=686, avg=98.69, stdev=66.19
+      lat (usec): min=17, max=686, avg=98.72, stdev=66.19
+     clat percentiles (usec):
+      |  1.00th=[   21],  5.00th=[   22], 10.00th=[   22], 20.00th=[   23],
+      | 30.00th=[   47], 40.00th=[   89], 50.00th=[  105], 60.00th=[  118],
+      | 70.00th=[  133], 80.00th=[  153], 90.00th=[  165], 95.00th=[  190],
+      | 99.00th=[  237], 99.50th=[  482], 99.90th=[  570], 99.95th=[  586],
+      | 99.99th=[  627]
+    bw (  MiB/s): min= 1172, max= 1383, per=100.00%, avg=1266.07, stdev=87.32, 
samples=7
+    iops        : min= 9380, max=11064, avg=10128.57, stdev=698.54, samples=7
+   lat (usec)   : 20=0.62%, 50=29.98%, 100=16.23%, 250=52.35%, 500=0.40%
+   lat (usec)   : 750=0.44%
+   cpu          : usr=0.73%, sys=99.22%, ctx=31, majf=15, minf=41
+   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
+      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
+      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
+      issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
+      latency   : target=0, window=0, percentile=100.00%, depth=1
  
  Run status group 0 (all jobs):
-    READ: bw=1249MiB/s (1309MB/s), 1249MiB/s-1249MiB/s (1309MB/s-1309MB/s), 
io=5000MiB (5243MB), run=4004-4004msec
+    READ: bw=1257MiB/s (1318MB/s), 1257MiB/s-1257MiB/s (1318MB/s-1318MB/s), 
io=5000MiB (5243MB), run=3977-3977msec
  
  6. ********** Attached files *********:
    - Flame graph of the fio command described in step 4 with both the shipped 
zfs.ko and the freshly built zfs.ko from zfs-dkms
    - zfs-shipped.ko and zfs-dkms.ko
    - patch0-MEM_STATIC-on-ZSTD_copy4-8-16.patch

** Description changed:

  1. *********  System info *********:
    - lscpu | grep -E "Model name|Flags"
  
  Model name:                           Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
  Flags:                                fpu vme de pse tsc msr pae mce cx8 apic 
sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good 
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl 
vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe 
popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch 
cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid 
ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap 
clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp 
hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities
  
    - lsb_release -rd
  
  Description:  Ubuntu 24.04.2 LTS
  Release:  24.04
  
  apt-cache policy linux-modules-6.8.0-53-generic zfs-dkms
  
  linux-modules-6.8.0-53-generic:
    Installed: 6.8.0-53.55
    Candidate: 6.8.0-53.55
    Version table:
   *** 6.8.0-53.55 500
          500 http://gb.archive.ubuntu.com/ubuntu noble-updates/main amd64 
Packages
          100 /var/lib/dpkg/status
  zfs-dkms:
    Installed: 2.2.2-0ubuntu9.1
    Candidate: 2.2.2-0ubuntu9.1
    Version table:
   *** 2.2.2-0ubuntu9.1 500
          500 http://gb.archive.ubuntu.com/ubuntu noble-updates/universe amd64 
Packages
          100 /var/lib/dpkg/status
       2.2.2-0ubuntu9 500
          500 http://gb.archive.ubuntu.com/ubuntu noble/universe amd64 Packages
  
  2. ************ Summary *************:
    - Poor code generation in the main zstd decompression loop in the shipped 
zfs.ko
      is resulting in a ~5x decompression slowdown compared to a freshly built 
zfs.ko from zfs-dkms.
      This issue is also likely affecting zstd compression.
  
  3. *********** Reproduction *********:
    - Create a test dataset with encryption disabled (to remove a possible 
cause):
      sudo zfs create -o encryption=off -o recordsize=128k -o compression=zstd 
rpool/test
  
    - Get some compressible file around ~500MB in size (anything will do as 
long as it is compressible).
      For example: wget 
https://download.qt.io/archive/qt/4.8/4.8.2/qt-everywhere-opensource-src-4.8.2.tar.gz
      gzip -d qt-everywhere-opensource-src-4.8.2.tar.gz
  
    - Benchmark sequential reads with fio:
      mv -v qt-everywhere-opensource-src-4.8.2.tar 
qt-everywhere-opensource-src-4.8.2.tar.0.0
      zpool sync
      # Run the fio command a few times to ensure the data is in the ARC.
      fio --name=qt-everywhere-opensource-src-4.8.2.tar --bs=128k --rw=read 
--numjobs=1 --iodepth=1 --size=500M --group_reporting --loops=10
  
- 4. ********** Expected results ********:
+ 4. ********** Results **********:
    - The fio command in step 3 benchmarks sequential cached reads from the ARC.
      We are therefore measuring single-threaded zstd decompression performance 
in zfs.ko.
  
    - With the shipped zfs.ko:
  
  qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=read, bs=(R) 128KiB-128KiB, 
(W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
  fio-3.36
  Starting 1 process
  Jobs: 1 (f=1): [R(1)][100.0%][r=236MiB/s][r=1886 IOPS][eta 00m:00s]
  qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: 
pid=4121: Thu Feb 27 15:50:38 2025
-   read: IOPS=1625, BW=203MiB/s (213MB/s)(5000MiB/24610msec)
-     clat (usec): min=18, max=4338, avg=613.58, stdev=469.32
-      lat (usec): min=18, max=4339, avg=613.62, stdev=469.32
-     clat percentiles (usec):
-      |  1.00th=[   21],  5.00th=[   23], 10.00th=[   23], 20.00th=[   25],
-      | 30.00th=[  208], 40.00th=[  523], 50.00th=[  668], 60.00th=[  766],
-      | 70.00th=[  889], 80.00th=[ 1037], 90.00th=[ 1172], 95.00th=[ 1385],
-      | 99.00th=[ 1713], 99.50th=[ 1811], 99.90th=[ 2180], 99.95th=[ 2409],
-      | 99.99th=[ 2966]
-    bw (  KiB/s): min=111360, max=355072, per=100.00%, avg=208216.82, 
stdev=83337.72, samples=49
-    iops        : min=  870, max= 2774, avg=1626.69, stdev=651.08, samples=49
-   lat (usec)   : 20=0.16%, 50=26.87%, 100=1.18%, 250=2.73%, 500=7.59%
-   lat (usec)   : 750=19.73%, 1000=18.87%
-   lat (msec)   : 2=22.64%, 4=0.25%, 10=0.01%
-   cpu          : usr=0.20%, sys=99.50%, ctx=178, majf=15, minf=42
-   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
-      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
-      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
-      issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
-      latency   : target=0, window=0, percentile=100.00%, depth=1
+   read: IOPS=1625, BW=203MiB/s (213MB/s)(5000MiB/24610msec)
+     clat (usec): min=18, max=4338, avg=613.58, stdev=469.32
+      lat (usec): min=18, max=4339, avg=613.62, stdev=469.32
+     clat percentiles (usec):
+      |  1.00th=[   21],  5.00th=[   23], 10.00th=[   23], 20.00th=[   25],
+      | 30.00th=[  208], 40.00th=[  523], 50.00th=[  668], 60.00th=[  766],
+      | 70.00th=[  889], 80.00th=[ 1037], 90.00th=[ 1172], 95.00th=[ 1385],
+      | 99.00th=[ 1713], 99.50th=[ 1811], 99.90th=[ 2180], 99.95th=[ 2409],
+      | 99.99th=[ 2966]
+    bw (  KiB/s): min=111360, max=355072, per=100.00%, avg=208216.82, 
stdev=83337.72, samples=49
+    iops        : min=  870, max= 2774, avg=1626.69, stdev=651.08, samples=49
+   lat (usec)   : 20=0.16%, 50=26.87%, 100=1.18%, 250=2.73%, 500=7.59%
+   lat (usec)   : 750=19.73%, 1000=18.87%
+   lat (msec)   : 2=22.64%, 4=0.25%, 10=0.01%
+   cpu          : usr=0.20%, sys=99.50%, ctx=178, majf=15, minf=42
+   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
+      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
+      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
+      issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
+      latency   : target=0, window=0, percentile=100.00%, depth=1
  
  Run status group 0 (all jobs):
-    READ: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), 
io=5000MiB (5243MB), run=24610-24610msec
- 
+    READ: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), 
io=5000MiB (5243MB), run=24610-24610msec
  
    - With zfs.ko from zfs-dkms:
  
  qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=read, bs=(R) 128KiB-128KiB, 
(W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
  fio-3.36
  Starting 1 process
  Jobs: 1 (f=1): [R(1)][100.0%][r=1152MiB/s][r=9219 IOPS][eta 00m:00s]
  qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: 
pid=692842: Fri Feb 21 13:11:39 2025
    read: IOPS=8871, BW=1109MiB/s (1163MB/s)(5000MiB/4509msec)
      clat (usec): min=9, max=770, avg=112.26, stdev=96.85
       lat (usec): min=9, max=770, avg=112.29, stdev=96.85
      clat percentiles (usec):
       |  1.00th=[   11],  5.00th=[   12], 10.00th=[   12], 20.00th=[   16],
       | 30.00th=[   23], 40.00th=[   25], 50.00th=[  122], 60.00th=[  151],
       | 70.00th=[  174], 80.00th=[  200], 90.00th=[  229], 95.00th=[  249],
       | 99.00th=[  322], 99.50th=[  553], 99.90th=[  644], 99.95th=[  676],
       | 99.99th=[  725]
     bw (  MiB/s): min=  936, max= 1466, per=100.00%, avg=1109.14, 
stdev=148.75, samples=9
     iops        : min= 7488, max=11734, avg=8873.11, stdev=1190.03, samples=9
    lat (usec)   : 10=0.10%, 20=20.52%, 50=22.47%, 100=3.51%, 250=48.45%
    lat (usec)   : 500=4.35%, 750=0.60%, 1000=0.01%
    cpu          : usr=0.67%, sys=99.29%, ctx=15, majf=0, minf=42
    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
       issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=1
  
  Run status group 0 (all jobs):
     READ: bw=1109MiB/s (1163MB/s), 1109MiB/s-1109MiB/s (1163MB/s-1163MB/s), 
io=5000MiB (5243MB), run=4509-4509msec
  
    - There is therefore a ~5x decompression performance difference between the 
shipped zfs.ko and zfs.ko from zfs-dkms.
      As a baseline, I have built and benchmarked zstd 1.4.5 (the version used 
in ZFS) on the same file:
  
      $ ./zstd-git/programs/zstd --version
      *** zstd command line interface 64-bits v1.4.5, by Yann Collet ***
  
      $ ./zstd-git/programs/zstd -b3 -B128KB 
qt-everywhere-opensource-src-4.8.2.tar.0.0
      3#src-4.8.2.tar.0.0 : 647249920 -> 247589078 (2.614), 370.5 MB/s ,1861.8 
MB/s
  
  5. ********** Investigation ************:
    - I used perf to profile the fio command, both with the shipped zfs.ko and 
the freshly built zfs.ko from zfs-dkms.
      The two flame graphs are attached.
  
    - With both versions of zfs.ko, the majority of the time is spent inside 
ZSTD_decompressSequences_bmi2.constprop.0.
      As expected, in both cases the BMI2 version of ZSTD_decompressSequences 
is selected.
  
    - However in the shipped zfs.ko the main loop of 
ZSTD_decompressSequences_bmi2.constprop.0 contains many calls
      to small functions that should have been inlined (MEM_64bits, MEM_32bits, 
BIT_reloadDStream, BIT_readBits, BIT_readBitsFast,
      ZSTD_copy16). MEM_64bits and MEM_32bits are particularly bad since their 
definitions are:
  
      MEM_STATIC unsigned MEM_32bits(void) { return sizeof(size_t)==4; }
      MEM_STATIC unsigned MEM_64bits(void) { return sizeof(size_t)==8; }
  
    - In the freshly built zfs.ko from zfs-dkms these small functions have been 
inlined.
      This has allowed the compiler to actually make use of BMI2 instructions 
(shlx and shrx).
  
    - As far as I can tell all shipped zfs.ko are affected by this problem. I 
have looked at zfs.ko in:
      - 5.15.0-133-generic
      - 6.8.0-56-generic
      - 6.11.0-18-generic
      - 6.12.0-15-generic
      and they all appear to have the same issue.
  
    - Zstd compression is also likely affected since the main loop 
ZSTD_encodeSequences_bmi2 appears to have the same issue.
      I have however not tested it.
  
    - There remains a few calls to ZSTD_copy4, ZSTD_copy8 and ZSTD_copy16 that 
should be inlined.
      A small patch (attached) adding MEM_STATIC to ZSTD_copy(4,8,16) results 
in a 12% decompression performance improvement:
  
  qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=read, bs=(R) 128KiB-128KiB, 
(W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
  fio-3.36
  Starting 1 process
  Jobs: 1 (f=1): [R(1)][100.0%][r=1304MiB/s][r=10.4k IOPS][eta 00m:00s]
  qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: 
pid=31764: Thu Feb 27 15:45:04 2025
-   read: IOPS=10.1k, BW=1257MiB/s (1318MB/s)(5000MiB/3977msec)
-     clat (usec): min=17, max=686, avg=98.69, stdev=66.19
-      lat (usec): min=17, max=686, avg=98.72, stdev=66.19
-     clat percentiles (usec):
-      |  1.00th=[   21],  5.00th=[   22], 10.00th=[   22], 20.00th=[   23],
-      | 30.00th=[   47], 40.00th=[   89], 50.00th=[  105], 60.00th=[  118],
-      | 70.00th=[  133], 80.00th=[  153], 90.00th=[  165], 95.00th=[  190],
-      | 99.00th=[  237], 99.50th=[  482], 99.90th=[  570], 99.95th=[  586],
-      | 99.99th=[  627]
-    bw (  MiB/s): min= 1172, max= 1383, per=100.00%, avg=1266.07, stdev=87.32, 
samples=7
-    iops        : min= 9380, max=11064, avg=10128.57, stdev=698.54, samples=7
-   lat (usec)   : 20=0.62%, 50=29.98%, 100=16.23%, 250=52.35%, 500=0.40%
-   lat (usec)   : 750=0.44%
-   cpu          : usr=0.73%, sys=99.22%, ctx=31, majf=15, minf=41
-   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
-      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
-      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
-      issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
-      latency   : target=0, window=0, percentile=100.00%, depth=1
+   read: IOPS=10.1k, BW=1257MiB/s (1318MB/s)(5000MiB/3977msec)
+     clat (usec): min=17, max=686, avg=98.69, stdev=66.19
+      lat (usec): min=17, max=686, avg=98.72, stdev=66.19
+     clat percentiles (usec):
+      |  1.00th=[   21],  5.00th=[   22], 10.00th=[   22], 20.00th=[   23],
+      | 30.00th=[   47], 40.00th=[   89], 50.00th=[  105], 60.00th=[  118],
+      | 70.00th=[  133], 80.00th=[  153], 90.00th=[  165], 95.00th=[  190],
+      | 99.00th=[  237], 99.50th=[  482], 99.90th=[  570], 99.95th=[  586],
+      | 99.99th=[  627]
+    bw (  MiB/s): min= 1172, max= 1383, per=100.00%, avg=1266.07, stdev=87.32, 
samples=7
+    iops        : min= 9380, max=11064, avg=10128.57, stdev=698.54, samples=7
+   lat (usec)   : 20=0.62%, 50=29.98%, 100=16.23%, 250=52.35%, 500=0.40%
+   lat (usec)   : 750=0.44%
+   cpu          : usr=0.73%, sys=99.22%, ctx=31, majf=15, minf=41
+   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
+      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
+      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
+      issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
+      latency   : target=0, window=0, percentile=100.00%, depth=1
  
  Run status group 0 (all jobs):
-    READ: bw=1257MiB/s (1318MB/s), 1257MiB/s-1257MiB/s (1318MB/s-1318MB/s), 
io=5000MiB (5243MB), run=3977-3977msec
+    READ: bw=1257MiB/s (1318MB/s), 1257MiB/s-1257MiB/s (1318MB/s-1318MB/s), 
io=5000MiB (5243MB), run=3977-3977msec
  
  6. ********** Attached files *********:
    - Flame graph of the fio command described in step 4 with both the shipped 
zfs.ko and the freshly built zfs.ko from zfs-dkms
    - zfs-shipped.ko and zfs-dkms.ko
    - patch0-MEM_STATIC-on-ZSTD_copy4-8-16.patch

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to zfs-linux in Ubuntu.
https://bugs.launchpad.net/bugs/2099745

Title:
  Poor code generation in shipped zfs.ko resulting in ~5x zstd
  decompression (and likely compression too) slowdown. Freshly built
  zfs.ko from zfs-dkms is fine

Status in zfs-linux package in Ubuntu:
  New

Bug description:
  1. *********  System info *********:
    - lscpu | grep -E "Model name|Flags"

  Model name:                           Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
  Flags:                                fpu vme de pse tsc msr pae mce cx8 apic 
sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good 
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl 
vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe 
popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch 
cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid 
ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap 
clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp 
hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities

    - lsb_release -rd

  Description:  Ubuntu 24.04.2 LTS
  Release:  24.04

  apt-cache policy linux-modules-6.8.0-53-generic zfs-dkms

  linux-modules-6.8.0-53-generic:
    Installed: 6.8.0-53.55
    Candidate: 6.8.0-53.55
    Version table:
   *** 6.8.0-53.55 500
          500 http://gb.archive.ubuntu.com/ubuntu noble-updates/main amd64 
Packages
          100 /var/lib/dpkg/status
  zfs-dkms:
    Installed: 2.2.2-0ubuntu9.1
    Candidate: 2.2.2-0ubuntu9.1
    Version table:
   *** 2.2.2-0ubuntu9.1 500
          500 http://gb.archive.ubuntu.com/ubuntu noble-updates/universe amd64 
Packages
          100 /var/lib/dpkg/status
       2.2.2-0ubuntu9 500
          500 http://gb.archive.ubuntu.com/ubuntu noble/universe amd64 Packages

  2. ************ Summary *************:
    - Poor code generation in the main zstd decompression loop in the shipped 
zfs.ko
      is resulting in a ~5x decompression slowdown compared to a freshly built 
zfs.ko from zfs-dkms.
      This issue is also likely affecting zstd compression.

  3. *********** Reproduction *********:
    - Create a test dataset with encryption disabled (to remove a possible 
cause):
      sudo zfs create -o encryption=off -o recordsize=128k -o compression=zstd 
rpool/test

    - Get some compressible file around ~500MB in size (anything will do as 
long as it is compressible).
      For example: wget 
https://download.qt.io/archive/qt/4.8/4.8.2/qt-everywhere-opensource-src-4.8.2.tar.gz
      gzip -d qt-everywhere-opensource-src-4.8.2.tar.gz

    - Benchmark sequential reads with fio:
      mv -v qt-everywhere-opensource-src-4.8.2.tar 
qt-everywhere-opensource-src-4.8.2.tar.0.0
      zpool sync
      # Run the fio command a few times to ensure the data is in the ARC.
      fio --name=qt-everywhere-opensource-src-4.8.2.tar --bs=128k --rw=read 
--numjobs=1 --iodepth=1 --size=500M --group_reporting --loops=10

  4. ********** Results **********:
    - The fio command in step 3 benchmarks sequential cached reads from the ARC.
      We are therefore measuring single-threaded zstd decompression performance 
in zfs.ko.

    - With the shipped zfs.ko:

  qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=read, bs=(R) 128KiB-128KiB, 
(W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
  fio-3.36
  Starting 1 process
  Jobs: 1 (f=1): [R(1)][100.0%][r=236MiB/s][r=1886 IOPS][eta 00m:00s]
  qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: 
pid=4121: Thu Feb 27 15:50:38 2025
    read: IOPS=1625, BW=203MiB/s (213MB/s)(5000MiB/24610msec)
      clat (usec): min=18, max=4338, avg=613.58, stdev=469.32
       lat (usec): min=18, max=4339, avg=613.62, stdev=469.32
      clat percentiles (usec):
       |  1.00th=[   21],  5.00th=[   23], 10.00th=[   23], 20.00th=[   25],
       | 30.00th=[  208], 40.00th=[  523], 50.00th=[  668], 60.00th=[  766],
       | 70.00th=[  889], 80.00th=[ 1037], 90.00th=[ 1172], 95.00th=[ 1385],
       | 99.00th=[ 1713], 99.50th=[ 1811], 99.90th=[ 2180], 99.95th=[ 2409],
       | 99.99th=[ 2966]
     bw (  KiB/s): min=111360, max=355072, per=100.00%, avg=208216.82, 
stdev=83337.72, samples=49
     iops        : min=  870, max= 2774, avg=1626.69, stdev=651.08, samples=49
    lat (usec)   : 20=0.16%, 50=26.87%, 100=1.18%, 250=2.73%, 500=7.59%
    lat (usec)   : 750=19.73%, 1000=18.87%
    lat (msec)   : 2=22.64%, 4=0.25%, 10=0.01%
    cpu          : usr=0.20%, sys=99.50%, ctx=178, majf=15, minf=42
    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
       issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=1

  Run status group 0 (all jobs):
     READ: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), 
io=5000MiB (5243MB), run=24610-24610msec

    - With zfs.ko from zfs-dkms:

  qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=read, bs=(R) 128KiB-128KiB, 
(W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
  fio-3.36
  Starting 1 process
  Jobs: 1 (f=1): [R(1)][100.0%][r=1152MiB/s][r=9219 IOPS][eta 00m:00s]
  qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: 
pid=692842: Fri Feb 21 13:11:39 2025
    read: IOPS=8871, BW=1109MiB/s (1163MB/s)(5000MiB/4509msec)
      clat (usec): min=9, max=770, avg=112.26, stdev=96.85
       lat (usec): min=9, max=770, avg=112.29, stdev=96.85
      clat percentiles (usec):
       |  1.00th=[   11],  5.00th=[   12], 10.00th=[   12], 20.00th=[   16],
       | 30.00th=[   23], 40.00th=[   25], 50.00th=[  122], 60.00th=[  151],
       | 70.00th=[  174], 80.00th=[  200], 90.00th=[  229], 95.00th=[  249],
       | 99.00th=[  322], 99.50th=[  553], 99.90th=[  644], 99.95th=[  676],
       | 99.99th=[  725]
     bw (  MiB/s): min=  936, max= 1466, per=100.00%, avg=1109.14, 
stdev=148.75, samples=9
     iops        : min= 7488, max=11734, avg=8873.11, stdev=1190.03, samples=9
    lat (usec)   : 10=0.10%, 20=20.52%, 50=22.47%, 100=3.51%, 250=48.45%
    lat (usec)   : 500=4.35%, 750=0.60%, 1000=0.01%
    cpu          : usr=0.67%, sys=99.29%, ctx=15, majf=0, minf=42
    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
       issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=1

  Run status group 0 (all jobs):
     READ: bw=1109MiB/s (1163MB/s), 1109MiB/s-1109MiB/s (1163MB/s-1163MB/s), 
io=5000MiB (5243MB), run=4509-4509msec

    - There is therefore a ~5x decompression performance difference between the 
shipped zfs.ko and zfs.ko from zfs-dkms.
      As a baseline, I have built and benchmarked zstd 1.4.5 (the version used 
in ZFS) on the same file:

      $ ./zstd-git/programs/zstd --version
      *** zstd command line interface 64-bits v1.4.5, by Yann Collet ***

      $ ./zstd-git/programs/zstd -b3 -B128KB 
qt-everywhere-opensource-src-4.8.2.tar.0.0
      3#src-4.8.2.tar.0.0 : 647249920 -> 247589078 (2.614), 370.5 MB/s ,1861.8 
MB/s

  5. ********** Investigation ************:
    - I used perf to profile the fio command, both with the shipped zfs.ko and 
the freshly built zfs.ko from zfs-dkms.
      The two flame graphs are attached.

    - With both versions of zfs.ko, the majority of the time is spent inside 
ZSTD_decompressSequences_bmi2.constprop.0.
      As expected, in both cases the BMI2 version of ZSTD_decompressSequences 
is selected.

    - However in the shipped zfs.ko the main loop of 
ZSTD_decompressSequences_bmi2.constprop.0 contains many calls
      to small functions that should have been inlined (MEM_64bits, MEM_32bits, 
BIT_reloadDStream, BIT_readBits, BIT_readBitsFast,
      ZSTD_copy16). MEM_64bits and MEM_32bits are particularly bad since their 
definitions are:

      MEM_STATIC unsigned MEM_32bits(void) { return sizeof(size_t)==4; }
      MEM_STATIC unsigned MEM_64bits(void) { return sizeof(size_t)==8; }

    - In the freshly built zfs.ko from zfs-dkms these small functions have been 
inlined.
      This has allowed the compiler to actually make use of BMI2 instructions 
(shlx and shrx).

    - As far as I can tell all shipped zfs.ko are affected by this problem. I 
have looked at zfs.ko in:
      - 5.15.0-133-generic
      - 6.8.0-56-generic
      - 6.11.0-18-generic
      - 6.12.0-15-generic
      and they all appear to have the same issue.

    - Zstd compression is also likely affected since the main loop 
ZSTD_encodeSequences_bmi2 appears to have the same issue.
      I have however not tested it.

    - There remains a few calls to ZSTD_copy4, ZSTD_copy8 and ZSTD_copy16 that 
should be inlined.
      A small patch (attached) adding MEM_STATIC to ZSTD_copy(4,8,16) results 
in a 12% decompression performance improvement:

  qt-everywhere-opensource-src-4.8.2.tar: (g=0): rw=read, bs=(R) 128KiB-128KiB, 
(W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
  fio-3.36
  Starting 1 process
  Jobs: 1 (f=1): [R(1)][100.0%][r=1304MiB/s][r=10.4k IOPS][eta 00m:00s]
  qt-everywhere-opensource-src-4.8.2.tar: (groupid=0, jobs=1): err= 0: 
pid=31764: Thu Feb 27 15:45:04 2025
    read: IOPS=10.1k, BW=1257MiB/s (1318MB/s)(5000MiB/3977msec)
      clat (usec): min=17, max=686, avg=98.69, stdev=66.19
       lat (usec): min=17, max=686, avg=98.72, stdev=66.19
      clat percentiles (usec):
       |  1.00th=[   21],  5.00th=[   22], 10.00th=[   22], 20.00th=[   23],
       | 30.00th=[   47], 40.00th=[   89], 50.00th=[  105], 60.00th=[  118],
       | 70.00th=[  133], 80.00th=[  153], 90.00th=[  165], 95.00th=[  190],
       | 99.00th=[  237], 99.50th=[  482], 99.90th=[  570], 99.95th=[  586],
       | 99.99th=[  627]
     bw (  MiB/s): min= 1172, max= 1383, per=100.00%, avg=1266.07, stdev=87.32, 
samples=7
     iops        : min= 9380, max=11064, avg=10128.57, stdev=698.54, samples=7
    lat (usec)   : 20=0.62%, 50=29.98%, 100=16.23%, 250=52.35%, 500=0.40%
    lat (usec)   : 750=0.44%
    cpu          : usr=0.73%, sys=99.22%, ctx=31, majf=15, minf=41
    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
       issued rwts: total=40000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=1

  Run status group 0 (all jobs):
     READ: bw=1257MiB/s (1318MB/s), 1257MiB/s-1257MiB/s (1318MB/s-1318MB/s), 
io=5000MiB (5243MB), run=3977-3977msec

  6. ********** Attached files *********:
    - Flame graph of the fio command described in step 4 with both the shipped 
zfs.ko and the freshly built zfs.ko from zfs-dkms
    - zfs-shipped.ko and zfs-dkms.ko
    - patch0-MEM_STATIC-on-ZSTD_copy4-8-16.patch

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/2099745/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to