Hi Michal,

Yeah, looks like there is something wrong with FIO and POSIX AIO, as the reads 
don’t seem to be really asynchronous. I don’t know why this is happening -might 
be related with the parameters I’m using- but is really bothering me.

What is bothering me even more, is that in Amazon EBS volumes I get quite 
better results than with our Ceph Cluster. We compiled Hammer with jemalloc 
support and now we are getting quite better results, but still not there. 
Obviusly I don’t have any idea how EBS works internally, but IMO ceph should 
get close to EBS, and I would love to identify the bottleneck -if there is one-,

But at least we really improved the latency with those changes, but I still 
have to invest more time on this. Now I’m busy with other projects -and the 
improvements we managed to get are quite substantial- but definitively I want 
to spend more time debugging this.


De: Michal Kozanecki [mailto:michal.kozane...@live.ca]
Enviado el: sábado, 11 de marzo de 2017 1:36
Para: dilla...@redhat.com; Xavier Trilla <xavier.tri...@silicontower.net>
CC: ceph-users <ceph-users@lists.ceph.com>
Asunto: Re: [ceph-users] Posix AIO vs libaio read performance

Hi Xavier,

Are you sure this is due to CEPH? I get similar results on my bare-metal (no 
cep anywhere in sight) hosts posix-aio vs libaio;

POSIX-AIO on baremetal (E3-1240v2, Debian Jessie 8.7, Linux 4.9.13, S3500 80GB):
andread-posix: (groupid=0, jobs=1): err= 0: pid=4644: Fri Mar 10 19:26:23 2017
  read : io=1024.0MB, bw=21243KB/s, iops=5310, runt= 49361msec

LIBAIO on baremetal (E3-1240v2, Debian Jessie 8.7, Linux 4.9.13, S3500 80GB):
randread-libaio: (groupid=0, jobs=1): err= 0: pid=32712: Fri Mar 10 19:24:33 
  read : io=1024.0MB, bw=272570KB/s, iops=68142, runt=  3847msec

Michal Kozanecki

On March 10, 2017 at 2:28:23 PM, Xavier Trilla 
(xavier.tri...@silicontower.net<mailto:xavier.tri...@silicontower.net>) wrote:
Hi Jason,

Just to add more information:

- The issue doesn't seem to be fio or glibc (guest) related, as it is working 
properly on other environments using the same software versions. Also I've 
tried using Ubuntu 14.04 and 16.04 and I'm getting really similar results, but 
I'll ran more tests just to be 100% sure.
- If I increase the number of concurrent jobs in fio (F.e. 16) results are much 
better (They get above 10k IOPS)
- I'm seeing similar bad results when using KRBD, but I still need to run more 
tests on this front (I'm using KRBD from inside a VM, because in our 
infrastructure getting your hands on a test physical machine it's quite 
difficult, but I'll manage. The VM has 10G connection, and I'm mounting the RBD 
volume from inside the VM using the kernel module -4.4- so the result should 
give an idea of how KRBD will perform)
- I'm not seeing improvements with librbd compiled with jemalloc support.
- No difference between QEMU 2.0, 2.5 or 2.7

Looks like it's related with an interaction of how POSIX AIO handles the direct 
reads and how Ceph works -but it could also be KVM related-. I could argue it's 
related with being a networked storage, but for example in other environments 
like Amazon EBS I'm not seeing this issue, but obviously I don't have any idea 
about EBS internals (But I guess that's what we are trying to match... if it 
works properly on EBS it should work properly on Ceph too ;) Also, I'm still 
trying to verify if this is just related to my setup or affects all ceph 

One of the things I find more strange, is the performance difference in the 
read department. Libaio performance is way better in both read and write, but 
the biggest difference is between posix aio read and librbd read.

BTW: Do you have a test environment were you could test fio using posix aio? 
I've been running tests in our production and test cluster, but they run almost 
the same version (hammer) of everything :/ Maybe I'll try to deploy a new 
cluster using jewel -if I can get my hands on enough hardware-. Here are the 
command lines for FIO:

fio --name=randread-posix --runtime 60 --ioengine=posixaio --buffered=0 
--direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32


fio --name=randread-libaio --runtime 60 --ioengine=libaio --buffered=0 
--direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32

Also thanks for the blktrace tip, on Monday I'll start playing with it and I'll 
post my findings.


-----Mensaje original-----
De: Jason Dillaman [mailto:jdill...@redhat.com]
Enviado el: viernes, 10 de marzo de 2017 19:18
Para: Xavier Trilla 
CC: Alexandre DERUMIER <aderum...@odiso.com<mailto:aderum...@odiso.com>>; 
ceph-users <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Asunto: Re: [ceph-users] Posix AIO vs libaio read performance

librbd doesn't know that you are using libaio vs POSIX AIO. Therefore, the best 
bet is that the issue is in fio or glibc. As a first step, I would recommend 
using blktrace (or similar) within your VM to determine if there is a delta 
between libaio and POSIX AIO at the block level.

On Fri, Mar 10, 2017 at 12:28 PM, Xavier Trilla 
<xavier.tri...@silicontower.net<mailto:xavier.tri...@silicontower.net>> wrote:
> I disabled rbd cache but no improvement, just a huge performance drop
> in writes (Which proves the cache was properly disabled).
> Now I’m working on two other fronts:
> - Using librbd with jemalloc in the Hypervisors (Hammer .10)
> - Compiling QEMU with jemalloc (QEMU 2.6)
> - Running some tests from a Bare Metal server using FIO tool, but it
> will use the librbd directly so no way to simulate POSIX AIO (Maybe
> I’ll try via KRBD)
> I’m quite sure is something on the client side, but I don’t know
> enough about the Ceph internals to totally discard the issue being related to 
> OSDs.
> But so far performance of the OSDs is really good using other test
> engines, so I’m working more on the client side.
> Any help or information would be really welcome J
> Thanks.
> Xavier.
> De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de
> Xavier Trilla Enviado el: viernes, 10 de marzo de 2017 14:13
> Para: Alexandre DERUMIER <aderum...@odiso.com<mailto:aderum...@odiso.com>>
> CC: ceph-users <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
> Asunto: Re: [ceph-users] Posix AIO vs libaio read performance
> Hi Alexandre,
> Debugging is disabled in client and osds.
> Regarding rbd cache, is something I will try -today I was thinking
> about it- but I did not try it yet because I don't want to reduce write speed.
> I also tried iothreads, but no benefit.
> I tried as well with virtio-blk and virtio-scsi, there is a small
> improvement with virtio-blk, but it's around a 10%.
> This is becoming a quite strange issue, as it only affects posix aio
> read performance. Nothing less seems to be affected -although posix
> aio write isn't nowhere near libaio performance-.
> Thanks for you help, if you have any other ideas they will be really
> appreciated.
> Also if somebody could run in their cluster from inside a VM the
> following
> command:
> fio --name=randread-posix --output ./test --runtime 60
> --ioengine=posixaio
> --buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m
> --iodepth=32
> It would be really helpful to know if I'm the only one affected or
> this is happening in all qemu + ceph setups.
> Thanks!
> Xavier
> El 10 mar 2017, a las 8:07, Alexandre DERUMIER 
> <aderum...@odiso.com<mailto:aderum...@odiso.com>>
> escribió:
> But it still looks like there is some bottleneck in QEMU o Librbd I
> cannot manage to find.
> you can improve latency on client with disable debug.
> on your client, create a /etc/ceph/ceph.conf with
> [global]
> debug asok = 0/0
> debug auth = 0/0
> debug buffer = 0/0
> debug client = 0/0
> debug context = 0/0
> debug crush = 0/0
> debug filer = 0/0
> debug filestore = 0/0
> debug finisher = 0/0
> debug heartbeatmap = 0/0
> debug journal = 0/0
> debug journaler = 0/0
> debug lockdep = 0/0
> debug mds = 0/0
> debug mds balancer = 0/0
> debug mds locker = 0/0
> debug mds log = 0/0
> debug mds log expire = 0/0
> debug mds migrator = 0/0
> debug mon = 0/0
> debug monc = 0/0
> debug ms = 0/0
> debug objclass = 0/0
> debug objectcacher = 0/0
> debug objecter = 0/0
> debug optracker = 0/0
> debug osd = 0/0
> debug paxos = 0/0
> debug perfcounter = 0/0
> debug rados = 0/0
> debug rbd = 0/0
> debug rgw = 0/0
> debug throttle = 0/0
> debug timer = 0/0
> debug tp = 0/0
> you can also disable rbd_cache=false or in qemu set cache=none.
> Using iothread on qemu drive should help a little bit too.
> ----- Mail original -----
> De: "Xavier Trilla" 
> <xavier.tri...@silicontower.net<mailto:xavier.tri...@silicontower.net>>
> À: "ceph-users" <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
> Envoyé: Vendredi 10 Mars 2017 05:37:01
> Objet: Re: [ceph-users] Posix AIO vs libaio read performance
> Hi,
> We compiled Hammer .10 to use jemalloc and now the cluster performance
> improved a lot, but POSIX AIO operations are still quite slower than libaio.
> Now with a single thread read operations are about 1000 per second and
> write operations about 5000 per second.
> Using same FIO configuration, but libaio read operations are about 15K
> per second and writes 12K per second.
> I’m compiling QEMU with jemalloc support as well, and I’m planning to
> replace librbd in QEMU hosts to the new one using jemalloc.
> But it still looks like there is some bottleneck in QEMU o Librbd I
> cannot manage to find.
> Any help will be much appreciated.
> Thanks.
> De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de
> Xavier Trilla Enviado el: jueves, 9 de marzo de 2017 6:56
> Para: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> Asunto: [ceph-users] Posix AIO vs libaio read performance
> Hi,
> I’m trying to debut why there is a big difference using POSIX AIO and
> libaio when performing read tests from inside a VM using librbd.
> The results I’m getting using FIO are:
> Type: Random Read - IO Engine: POSIX AIO - Buffered: No - Direct: Yes
> - Block Size: 4KB - Disk Target: /:
> Average: 2.54 MB/s
> Average: 632 IOPS
> Libaio Read:
> Type: Random Read - IO Engine: Libaio - Buffered: No - Direct: Yes -
> Block
> Size: 4KB - Disk Target: /:
> Average: 147.88 MB/s
> Average: 36967 IOPS
> When performing writes the differences aren’t so big, because the
> cluster –which is in production right now- is CPU bonded:
> POSIX AIO Write:
> Type: Random Write - IO Engine: POSIX AIO - Buffered: No - Direct: Yes
> - Block Size: 4KB - Disk Target: /:
> Average: 14.87 MB/s
> Average: 3713 IOPS
> Libaio Write:
> Type: Random Write - IO Engine: Libaio - Buffered: No - Direct: Yes -
> Block
> Size: 4KB - Disk Target: /:
> Average: 14.51 MB/s
> Average: 3622 IOPS
> Even if the write results are CPU bonded, as the machines containing
> the OSDs don’t have enough CPU to handle all the IOPS (CPU upgrades
> are on its
> way) I cannot really understand why I’m seeing so much difference in
> the read tests.
> Some configuration background:
> - Cluster and clients are using Hammer 0.94.90
> - It’s a full SSD cluster running over Samsung Enterprise SATA SSDs,
> with all the typical tweaks (Customized ceph.conf, optimized sysctl,
> etc…)
> - Tried QEMU 2.0 and 2.7 – Similar results
> - Tried virtio-blk and virtio-scsi – Similar results
> I’ve been reading about POSIX AIO and Libaio, and I can see there are
> several differences on how they work (Like one being user space and
> the other one being kernel) but I don’t really get why Ceph have such
> problems handling POSIX AIO read operations, but not write operation,
> and how to avoid them.
> Right now I’m trying to identify if it’s something wrong with our Ceph
> cluster setup, with Ceph in general or with QEMU (virtio-scsi or
> virtio-blk as both have the same behavior)
> If you would like to try to reproduce the issue here are the two
> command lines I’m using:
> fio --name=randread-posix --output ./test --runtime 60
> --ioengine=posixaio
> --buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m
> --iodepth=32
> fio --name=randread-libaio --output ./test --runtime 60
> --ioengine=libaio
> --buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m
> --iodepth=32
> If you could shed any light over this I would be really helpful, as
> right now, although I have still some ideas left to try, I’m don’t
> have much idea about why is this happening…
> Thanks!
> Xavier
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

ceph-users mailing list
ceph-users mailing list

Reply via email to