Hello, 

Nothing new, I know. But some numbers to mull and ultimately weep over.

Ceph cluster based on Debian Jessie (thus ceph 0.72.x), 2 nodes, 2 OSDs
each. 
Infiniband 4xQDR, IPoIB interconnects, 1 GByte/s bandwidth end to end. 
There was nothing going on aside from the tests.

Just going to use the bonnie++ values for throughput to keep it simple and
short.

On the OSD itself:
---
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
ceph-01         64G           1309731  96 467763  51           1703299  79 
784.0  32
----

On a compute node, host side:
---
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
comp-02        256G           296928  60 64216  16           145015  17 291.6  
12
---
Ouch. Well the write speed is probably the OSD journal SSDs being hobbled
by being on SATA-2 links of the onboard AMD chipset. I had planned for
that shortcoming, alas the cheap and cheerful Marvell 88SE9230 based
PCIex4 controller can't get a stable link under any linux kernel I tried.
OTOH, I don't expect more than 30MB/s average writes for all the VMs
combined. 
Despite having been aware of the sequential read speed issues, I really
was disappointed here. 10% of a single OSD. The OSD processes and actual
disks were bored stiff during the read portion of that bonnie run.

OK, lets increase read-ahead (no or negative effects on the OSDs, FYI
since I've seen that mentioned a few times as well.
So after a "echo 4096 > /sys/block/vda/queue/read_ahead_kb" we get:
---
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
comp-02        256G           280277  44 158633  30           655827  46 577.9  
17
---
Better, not great, but certainly around what I expected.

So lets see how this is inside a VM (Wheezy). This is ganeti on jessie,
thus no qemu caching and kernelspace RBD (no qemu with userspace support
outside sid/experimental yet):
---
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
fp-001           8G           170374  29 27599   7           34059   5 328.0  12
---
Le mega ouch. So writes are down to 10% of the OSD and the reads are...
deplorable. 
Setting the read-ahead inside the VM to 4MB gives us about 380MB/s reads,
so in line with the writes, that is half of the host speed. 
I will test this with userspace qemu when available. 

However setting the read-ahead may not be a feasible option, be it access
to the VM, it being upgraded, etc. 
Something more transparent that can be controlled by the people running
the host or ceph cluster is definitely needed:
https://wiki.ceph.com/Planning/Blueprints/Emperor/Kernel_client_read_ahead_optimization

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to