Hello, Nothing new, I know. But some numbers to mull and ultimately weep over.
Ceph cluster based on Debian Jessie (thus ceph 0.72.x), 2 nodes, 2 OSDs each. Infiniband 4xQDR, IPoIB interconnects, 1 GByte/s bandwidth end to end. There was nothing going on aside from the tests. Just going to use the bonnie++ values for throughput to keep it simple and short. On the OSD itself: --- Version 1.97 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP ceph-01 64G 1309731 96 467763 51 1703299 79 784.0 32 ---- On a compute node, host side: --- Version 1.97 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP comp-02 256G 296928 60 64216 16 145015 17 291.6 12 --- Ouch. Well the write speed is probably the OSD journal SSDs being hobbled by being on SATA-2 links of the onboard AMD chipset. I had planned for that shortcoming, alas the cheap and cheerful Marvell 88SE9230 based PCIex4 controller can't get a stable link under any linux kernel I tried. OTOH, I don't expect more than 30MB/s average writes for all the VMs combined. Despite having been aware of the sequential read speed issues, I really was disappointed here. 10% of a single OSD. The OSD processes and actual disks were bored stiff during the read portion of that bonnie run. OK, lets increase read-ahead (no or negative effects on the OSDs, FYI since I've seen that mentioned a few times as well. So after a "echo 4096 > /sys/block/vda/queue/read_ahead_kb" we get: --- Version 1.97 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP comp-02 256G 280277 44 158633 30 655827 46 577.9 17 --- Better, not great, but certainly around what I expected. So lets see how this is inside a VM (Wheezy). This is ganeti on jessie, thus no qemu caching and kernelspace RBD (no qemu with userspace support outside sid/experimental yet): --- Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP fp-001 8G 170374 29 27599 7 34059 5 328.0 12 --- Le mega ouch. So writes are down to 10% of the OSD and the reads are... deplorable. Setting the read-ahead inside the VM to 4MB gives us about 380MB/s reads, so in line with the writes, that is half of the host speed. I will test this with userspace qemu when available. However setting the read-ahead may not be a feasible option, be it access to the VM, it being upgraded, etc. Something more transparent that can be controlled by the people running the host or ceph cluster is definitely needed: https://wiki.ceph.com/Planning/Blueprints/Emperor/Kernel_client_read_ahead_optimization Regards, Christian -- Christian Balzer Network/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com