I completely agree that the test I did is not suitable for testing ceph performance. I merely did the same command as the OP and obtained very different results. I suspect the performance difference is much more due to things like network, OS config, memory constraints, etc. But that needs a rigorous set of tests to be done on the infrastructure before looking at ceph itself.

For example, we found that many 10g NICs don't handle full-sized jumbo frames (9000) without a performance hit. We got a huge performance improvement by finding the exact optimal MTU (5139 on a cluster I am looking at - at 5140 it falls off a cliff). Just one example.

Chris

On 25/03/2025 13:35, Konold, Martin wrote:
Am 2025-03-20 15:15, schrieb Chris Palmer:

Hi,

 * Ceph cluster 19.2.1 with 3 nodes, 4 x SATA disks with shared NVMe
   DB/WAL, single 10g NICs
 * Promox 8.3.5 cluster with 2 nodes (separate nodes to Ceph), single
   10g NICs , single 1g NICs for corosync
 * Test VM was using KRBD R3 pool on HDD, iothread=1, aio=io_uring,
   cache=writeback

"cache=writeback" measures basically the memory bandwidth of the kvm-host not the Ceph performance. This gets even worse with --size=1G instead of more realistic 10G.

Please remove the writeback cache and increase the filesize to at least 10G in order to measure Ceph rbd performance

Last but not least in order to make the benchmark independent of previous runs you shall please drop the caches.

Overview Results for blocksize=4K
=================================

NVMe with cache=none (10 GB Fibre)     |  read: IOPS = 12.9k, BW =  50.3 MiB/s NVMe with cache=writeback (10 GB Fibre)|  read: IOPS = 58.3k, BW = 228.0 MiB/s HDD with cache=none (10 GB Fibre)      |  read: IOPS = 0.34k, BW =   1.3 MiB/s HDD with cache=writeback  (10 GB Fibre)|  read: IOPS = 53.8k, BW = 210.0 MiB/s


Detailed reports
================
NVME with cache=none

# echo 3 > /proc/sys/vm/drop_caches ; fio --name=registry-read --ioengine=libaio --rw=randread --bs=4k --numjobs=4 --size=10G --runtime=60 --group_reporting --iodepth=16 registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.36
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=39.0MiB/s][r=9981 IOPS][eta 00m:00s]
registry-read: (groupid=0, jobs=4): err= 0: pid=443550: Tue Mar 25 13:25:10 2025
  read: IOPS=12.9k, BW=50.3MiB/s (52.7MB/s)(3017MiB/60001msec)
    slat (usec): min=2, max=13134, avg=307.13, stdev=154.43
    clat (usec): min=2, max=28147, avg=4662.15, stdev=1048.17
     lat (usec): min=384, max=28432, avg=4969.28, stdev=1106.45
    clat percentiles (usec):
     |  1.00th=[ 2737],  5.00th=[ 3261], 10.00th=[ 3589], 20.00th=[ 4080],      | 30.00th=[ 4293], 40.00th=[ 4424], 50.00th=[ 4555], 60.00th=[ 4621],      | 70.00th=[ 4817], 80.00th=[ 5080], 90.00th=[ 5932], 95.00th=[ 6325],      | 99.00th=[ 8225], 99.50th=[ 9634], 99.90th=[12911], 99.95th=[14353],
     | 99.99th=[17957]
   bw (  KiB/s): min=26816, max=58136, per=100.00%, avg=51622.66, stdev=1587.61, samples=476    iops        : min= 6704, max=14534, avg=12905.66, stdev=396.90, samples=476
  lat (usec)   : 4=0.01%, 10=0.01%, 500=0.01%, 1000=0.01%
  lat (msec)   : 2=0.03%, 4=18.10%, 10=81.47%, 20=0.40%, 50=0.01%
  cpu          : usr=1.52%, sys=8.43%, ctx=726528, majf=0, minf=109
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=772457,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=50.3MiB/s (52.7MB/s), 50.3MiB/s-50.3MiB/s (52.7MB/s-52.7MB/s), io=3017MiB (3164MB), run=60001-60001msec

Disk stats (read/write):
  sdc: ios=725474/0, sectors=5803792/0, merge=0/0, ticks=223185/0, in_queue=223185, util=99.89%

NVMe with cache=writeback
# echo 3 > /proc/sys/vm/drop_caches ; fio --name=registry-read --ioengine=libaio --rw=randread --bs=4k --numjobs=4 --size=10G --runtime=60 --group_reporting --iodepth=16 registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.36
Starting 4 processes
Jobs: 3 (f=3): [r(3),_(1)][65.9%][r=31.9MiB/s][r=8165 IOPS][eta 00m:31s]
registry-read: (groupid=0, jobs=4): err= 0: pid=442595: Tue Mar 25 13:19:11 2025
  read: IOPS=58.3k, BW=228MiB/s (239MB/s)(13.3GiB/60001msec)
    slat (usec): min=2, max=5185, avg=58.95, stdev=131.23
    clat (nsec): min=1440, max=15294k, avg=908789.46, stdev=1764896.35
     lat (usec): min=3, max=15367, avg=967.74, stdev=1880.61
    clat percentiles (usec):
     |  1.00th=[   55],  5.00th=[   58], 10.00th=[   67], 20.00th=[  117],      | 30.00th=[  133], 40.00th=[  165], 50.00th=[  196], 60.00th=[  239],      | 70.00th=[  347], 80.00th=[  873], 90.00th=[ 4146], 95.00th=[ 6063],      | 99.00th=[ 6652], 99.50th=[ 6980], 99.90th=[ 8717], 99.95th=[ 9503],
     | 99.99th=[11076]
   bw (  KiB/s): min=203632, max=602000, per=100.00%, avg=387963.51, stdev=30229.81, samples=420    iops        : min=50908, max=150500, avg=96990.60, stdev=7557.47, samples=420
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=15.15%, 250=46.78%, 500=10.58%, 750=5.17%, 1000=6.62%
  lat (msec)   : 2=3.90%, 4=1.46%, 10=10.29%, 20=0.03%
  cpu          : usr=2.96%, sys=15.72%, ctx=1029435, majf=0, minf=108
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=3499461,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=228MiB/s (239MB/s), 228MiB/s-228MiB/s (239MB/s-239MB/s), io=13.3GiB (14.3GB), run=60001-60001msec

Disk stats (read/write):
  sdb: ios=1028680/0, sectors=8229440/0, merge=0/0, ticks=180767/0, in_queue=180766, util=99.60%

HDD with cache=none
# echo 3 > /proc/sys/vm/drop_caches ; fio --name=registry-read --ioengine=libaio --rw=randread --bs=4k --numjobs=4 --size=10G --runtime=60 --group_reporting --iodepth=16 registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.36
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=1033KiB/s][r=258 IOPS][eta 00m:00s]
registry-read: (groupid=0, jobs=4): err= 0: pid=442579: Tue Mar 25 13:15:51 2025
  read: IOPS=335, BW=1340KiB/s (1373kB/s)(78.6MiB/60011msec)
    slat (usec): min=160, max=361240, avg=11930.20, stdev=15819.52
    clat (usec): min=4, max=957176, avg=178747.09, stdev=89162.82
     lat (msec): min=5, max=977, avg=190.68, stdev=93.50
    clat percentiles (msec):
     |  1.00th=[   86],  5.00th=[  101], 10.00th=[  109], 20.00th=[  121],      | 30.00th=[  130], 40.00th=[  140], 50.00th=[  150], 60.00th=[  163],      | 70.00th=[  180], 80.00th=[  218], 90.00th=[  300], 95.00th=[  372],      | 99.00th=[  514], 99.50th=[  558], 99.90th=[  726], 99.95th=[  802],
     | 99.99th=[  885]
   bw (  KiB/s): min=  248, max= 2160, per=99.89%, avg=1339.50, stdev=117.47, samples=476    iops        : min=   62, max=  540, avg=334.87, stdev=29.37, samples=476
  lat (usec)   : 10=0.02%
  lat (msec)   : 10=0.01%, 20=0.01%, 50=0.01%, 100=4.87%, 250=79.56%
  lat (msec)   : 500=14.36%, 750=1.08%, 1000=0.08%
  cpu          : usr=0.06%, sys=0.27%, ctx=20110, majf=0, minf=110
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=99.7%, 32=0.0%, >=64=0.0%      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=20110,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=1340KiB/s (1373kB/s), 1340KiB/s-1340KiB/s (1373kB/s-1373kB/s), io=78.6MiB (82.4MB), run=60011-60011msec

Disk stats (read/write):
  sdd: ios=20096/0, sectors=160768/0, merge=0/0, ticks=238953/0, in_queue=238952, util=99.89%

HDD with cache=writeback
# echo 3 > /proc/sys/vm/drop_caches ; fio --name=registry-read --ioengine=libaio --rw=randread --bs=4k --numjobs=4 --size=10G --runtime=60 --group_reporting --iodepth=16 registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.36
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=210MiB/s][r=53.7k IOPS][eta 00m:00s]
registry-read: (groupid=0, jobs=4): err= 0: pid=943730: Thu Mar 20 14:51:33 2025
  read: IOPS=53.8k, BW=210MiB/s (220MB/s)(12.3GiB/60001msec)
    slat (usec): min=26, max=4995, avg=71.31, stdev=21.64
    clat (usec): min=3, max=8707, avg=1116.26, stdev=141.55
     lat (usec): min=79, max=8769, avg=1187.56, stdev=148.56
    clat percentiles (usec):
     |  1.00th=[  938],  5.00th=[  979], 10.00th=[ 1004], 20.00th=[ 1029],      | 30.00th=[ 1045], 40.00th=[ 1074], 50.00th=[ 1090], 60.00th=[ 1106],      | 70.00th=[ 1139], 80.00th=[ 1188], 90.00th=[ 1254], 95.00th=[ 1336],      | 99.00th=[ 1582], 99.50th=[ 1811], 99.90th=[ 2474], 99.95th=[ 2802],
     | 99.99th=[ 3982]
   bw (  KiB/s): min=167800, max=230352, per=100.00%, avg=215502.18, stdev=2114.67, samples=476    iops        : min=41950, max=57588, avg=53875.55, stdev=528.67, samples=476
  lat (usec)   : 4=0.01%, 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=10.02%
  lat (msec)   : 2=89.68%, 4=0.28%, 10=0.01%
  cpu          : usr=4.83%, sys=37.00%, ctx=3232089, majf=0, minf=101
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=3230027,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=210MiB/s (220MB/s), 210MiB/s-210MiB/s (220MB/s-220MB/s), io=12.3GiB (13.2GB), run=60001-60001msec

Disk stats (read/write):
  sdd: ios=3224017/2, sectors=25792136/3, merge=0/0, ticks=168114/14, in_queue=168141, util=99.03%

This was hdd (3/2 replication).


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to