Hi,

We’ve been running an all-SSD Ceph cluster for a few months now and generally 
are very happy with it.

However, we’ve noticed that if we create a snapshot of an RBD device, then 
writing to the RBD goes massively slower than before we took the snapshot.  
Similarly, we get poor performance if we make a clone of that snapshot and 
write to it.

For example, using fio to run a 2-worker 4kb synchronous random write 
benchmark, we normally get about 5000 IOPS to RBD on our test-sized cluster 
(Intel 3710, 10G networking, Ubuntu 14.04).  But as soon as I take a snapshot, 
this goes down to about 100 IOPS, and with high variability - at times 0 IOPS, 
60 IOPS, or 300 IOPS.

I realise that after a snapshot, any write will trigger a copy of the block, 
which by default would be 4 MB of data - to minimize this effect I’ve reduced 
the RBD order to 18 ie. 256 KB blocks.

But shouldn’t that effect only degrade it to the same performance as we get on 
a completely new RBD image that has no snapshots and no data?  For us that is 
more like 1000-1500 IOPS ie. still at least 10x better than the performance we 
get after a snapshot is taken.

Is there something particularly inefficient about the copy-on-write block 
implementation that makes it much worse than writing to fresh blocks?  Note 
that we get this performance drop even if the other data on the blocks are 
cached in memory, and since we’re using fast SSDs, the time to read in the rest 
of the 256 KB should be negligible.

We’re currently using Hammer but we also tested with Infernalis and it didn’t 
seem any better.

Cheers,
Will
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to