Re: [ceph-users] The woes of sequential reads

Christian Balzer Sun, 13 Apr 2014 23:00:07 -0700


Since I now have a qemu with RBD userspace support, lets add this to
what is listed below.


Inside Wheeze VM, RBD userspace, standard cache enabled (in both ceph.conf
and qemu):
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
tvm-02           8G           224015  34 57507  15           84728  13  2643  74

So basically twice as fast than the kernelspace variant below.

Regards,

Christian

On Mon, 7 Apr 2014 19:01:30 +0900 Christian Balzer wrote:

> 
> Hello, 
> 
> Nothing new, I know. But some numbers to mull and ultimately weep over.
> 
> Ceph cluster based on Debian Jessie (thus ceph 0.72.x), 2 nodes, 2 OSDs
> each. 
> Infiniband 4xQDR, IPoIB interconnects, 1 GByte/s bandwidth end to end. 
> There was nothing going on aside from the tests.
> 
> Just going to use the bonnie++ values for throughput to keep it simple
> and short.
> 
> On the OSD itself:
> ---
> Version  1.97       ------Sequential Output------ --Sequential Input-
> --Random- Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr-
> --Block-- --Seeks-- Machine        Size K/sec %CP K/sec %CP K/sec %CP
> K/sec %CP K/sec %CP  /sec %CP ceph-01         64G           1309731  96
> 467763  51           1703299  79 784.0  32 ----
> 
> On a compute node, host side:
> ---
> Version  1.97       ------Sequential Output------ --Sequential Input-
> --Random- Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr-
> --Block-- --Seeks-- Machine        Size K/sec %CP K/sec %CP K/sec %CP
> K/sec %CP K/sec %CP  /sec %CP comp-02        256G           296928  60
> 64216  16           145015  17 291.6  12 ---
> Ouch. Well the write speed is probably the OSD journal SSDs being hobbled
> by being on SATA-2 links of the onboard AMD chipset. I had planned for
> that shortcoming, alas the cheap and cheerful Marvell 88SE9230 based
> PCIex4 controller can't get a stable link under any linux kernel I tried.
> OTOH, I don't expect more than 30MB/s average writes for all the VMs
> combined. 
> Despite having been aware of the sequential read speed issues, I really
> was disappointed here. 10% of a single OSD. The OSD processes and actual
> disks were bored stiff during the read portion of that bonnie run.
> 
> OK, lets increase read-ahead (no or negative effects on the OSDs, FYI
> since I've seen that mentioned a few times as well.
> So after a "echo 4096 > /sys/block/vda/queue/read_ahead_kb" we get:
> ---
> Version  1.97       ------Sequential Output------ --Sequential Input-
> --Random- Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr-
> --Block-- --Seeks-- Machine        Size K/sec %CP K/sec %CP K/sec %CP
> K/sec %CP K/sec %CP  /sec %CP comp-02        256G           280277  44
> 158633  30           655827  46 577.9  17 ---
> Better, not great, but certainly around what I expected.
> 
> So lets see how this is inside a VM (Wheezy). This is ganeti on jessie,
> thus no qemu caching and kernelspace RBD (no qemu with userspace support
> outside sid/experimental yet):
> ---
> Version  1.96       ------Sequential Output------ --Sequential Input-
> --Random- Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr-
> --Block-- --Seeks-- Machine        Size K/sec %CP K/sec %CP K/sec %CP
> K/sec %CP K/sec %CP  /sec %CP fp-001           8G           170374  29
> 27599   7           34059   5 328.0  12 ---
> Le mega ouch. So writes are down to 10% of the OSD and the reads are...
> deplorable. 
> Setting the read-ahead inside the VM to 4MB gives us about 380MB/s reads,
> so in line with the writes, that is half of the host speed. 
> I will test this with userspace qemu when available. 
> 
> However setting the read-ahead may not be a feasible option, be it access
> to the VM, it being upgraded, etc. 
> Something more transparent that can be controlled by the people running
> the host or ceph cluster is definitely needed:
> https://wiki.ceph.com/Planning/Blueprints/Emperor/Kernel_client_read_ahead_optimization
> 
> Regards,
> 
> Christian


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] The woes of sequential reads

Reply via email to