We have deployed a small test cluster consisting of three nodes. Each node is 
running a mon/mgr and two osds (Samsung PM983 3,84TB NVMe split into two 
partitions), so six osds in total. We started with Ceph 14.2.7 some weeks ago 
(upgraded to 14.2.9 later) and ran different tests using fio against some rbd 
volumes in order to get an overview what performance we could expect. The 
configuration is unchanged compared to the defaults, we only set several 
debugging options to 0/0.

Yesterday we upgraded the whole cluster following the upgrade guidelines to 
Ceph 15.2.3, which worked without any problems so far. Nevertheless when 
running the same tests as before with Ceph 14.2.9, we are seeing some clear 
degradations in write-performance (beside some performance improvements, which 
shall also be mentioned).

Here the results of concern (each with the relevant fio settings used):

Test "read-latency-max"
(rw=randread, iodepth=64, bs=4k)
read_iops: 32500 -> 87000

Test "write-latency-max"
(rw=randwrite, iodepth=64, bs=4k)
write_iops: 22500 -> 11500

Test "write-throughput-iops-max"
(rw=write, iodepth=64, bs=4k)
write_iops: 7000 -> 14000

Test "usecase1"
(rw=randrw, 
bssplit=4k/40:8k/5:16k/20:32k/5:64k/10:128k/10:256k/,4k/50:8k/20:16k/20:32k/5:64k/2:128k/:256k/,
 rwmixread=1, rate_process=poisson, iodepth=64)
write_iops: 21000 -> 8500

Test "usecase1-readonly"
(rw=randread, bssplit=4k/40:8k/5:16k/20:32k/5:64k/10:128k/10:256k/, 
rate_process=poisson, iodepth=64)
read_iops: 28000 -> 58000

The last two tests represent a typical use case on our systems. Therefore we 
are especially concerned by the drop in performance from 21000 w/ops to 8500 
w/ops (about 60%) after upgrading to Ceph 15.2.3. 

We ran all tests several times, the values are averaged over all iterations and 
fairly consistent and reproducible. We even tried wiping the whole cluster, 
downgrading to Ceph 14.2.9 again, setting up a new cluster/pool, running the 
tests and upgrading to Ceph 15.2.3 again. The tests have been performed on one 
of the three cluster nodes using a 50G rbd volume, which had been prefilled 
with random data before each test-run.

Have any changes been introduced with Octopus that could explain the observed 
changes in performance?

What we already tried:

- Disabling rbd cache
- Reverting rbc cache policy to writeback (default in 14.2)
- Setting rbd io scheduler to none
- Deploying a fresh cluster starting with Ceph 15.2.3

Kernel is 5.4.38 … I don't know if some other system specs would be helpful 
besides the already mentioned (since we are talking about a relative change in 
performance after upgrading Ceph without any further changes) - if so, please 
let us know.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to