On 07/09/2014 09:53 AM, Christian Balzer wrote:
On Wed, 09 Jul 2014 07:07:50 -0500 Mark Nelson wrote:

On 07/09/2014 06:52 AM, Xabier Elkano wrote:
El 09/07/14 13:10, Mark Nelson escribió:
On 07/09/2014 05:57 AM, Xabier Elkano wrote:


Hi,

I was doing some tests in my cluster with fio tool, one fio instance
with 70 jobs, each job writing 1GB random with 4K block size. I did
this test with 3 variations:

1- Creating 70 images, 60GB each, in the pool. Using rbd kernel
module, format and mount each image as ext4. Each fio job writing in
a separate image/directory. (ioengine=libaio, queue_depth=4,
direct=1)

      IOPS: 6542
      AVG LAT: 41ms

2- Creating 1 large image 4,2TB in the pool. Using rbd kernel module,
format and mount the image as ext4. Each fio job writing in a
separate file in the same directory. (ioengine=libaio,
queue_depth=4,direct=1)

     IOPS: 5899
     AVG LAT:  47ms

3- Creating 1 large image 4,2TB in the pool. Use ioengine rbd in fio
to access the image through librados. (ioengine=rbd,
queue_depth=4,direct=1)

     IOPS: 2638
     AVG LAT: 96ms

Do these results make sense? From Ceph perspective, It is better to
have many small images than a larger one? What is the best approach
to simulate the workload of 70 VMs?

I'm not sure the difference between the first two cases is enough to
say much yet.  You may need to repeat the test a couple of times to
ensure that the difference is more than noise.  having said that, if
we are seeing an effect, it would be interesting to know what the
latency distribution is like.  is it consistently worse in the 2nd
case or do we see higher spikes at specific times?

I've repeated the tests with similar results. Each test is done with a
clean new rbd image, first removing any existing images in the pool and
then creating the new image. Between tests I am running:

   echo 3 > /proc/sys/vm/drop_caches

- In the first test I've created 70 images (60G) and mounted them:

/dev/rbd1 on /mnt/fiotest/vtest0
/dev/rbd2 on /mnt/fiotest/vtest1
..
/dev/rbd70 on /mnt/fiotest/vtest69

fio output:

rand-write-4k: (groupid=0, jobs=70): err= 0: pid=21852: Tue Jul  8
14:52:56 2014
    write: io=2559.5MB, bw=26179KB/s, iops=6542, runt=100116msec
      slat (usec): min=18, max=512646, avg=4002.62, stdev=13754.33
      clat (usec): min=867, max=579715, avg=37581.64, stdev=55954.19
       lat (usec): min=903, max=586022, avg=41957.74, stdev=59276.40
      clat percentiles (msec):
       |  1.00th=[    5],  5.00th=[   10], 10.00th=[   13],
20.00th=[   18], | 30.00th=[   21], 40.00th=[   26], 50.00th=[   31],
60.00th=[   34], | 70.00th=[   37], 80.00th=[   41], 90.00th=[   48],
95.00th=[   61], | 99.00th=[  404], 99.50th=[  445], 99.90th=[  494],
99.95th=[  515], | 99.99th=[  553]
      bw (KB  /s): min=    0, max=  694, per=1.46%, avg=383.29,
stdev=148.01 lat (usec) : 1000=0.01%
      lat (msec) : 2=0.12%, 4=0.63%, 10=4.82%, 20=22.33%, 50=63.97%
      lat (msec) : 100=5.61%, 250=0.47%, 500=2.01%, 750=0.08%
    cpu          : usr=0.69%, sys=2.57%, ctx=1525021, majf=0, minf=2405
    IO depths    : 1=1.1%, 2=0.6%, 4=335.8%, 8=0.0%, 16=0.0%, 32=0.0%,
=64=0.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
       issued    : total=r=0/w=655015/d=0, short=r=0/w=0/d=0
       latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
    WRITE: io=2559.5MB, aggrb=26178KB/s, minb=26178KB/s, maxb=26178KB/s,
mint=100116msec, maxt=100116msec

Disk stats (read/write):
    rbd1: ios=0/2408612, merge=0/979004, ticks=0/39436432,
in_queue=39459720, util=99.68%

- In the second test I only created one large image (4,2T)

/dev/rbd1 on /mnt/fiotest/vtest0 type ext4
(rw,noatime,nodiratime,data=ordered)

fio output:

rand-write-4k: (groupid=0, jobs=70): err= 0: pid=8907: Wed Jul  9
13:38:14 2014
    write: io=2264.6MB, bw=23143KB/s, iops=5783, runt=100198msec
      slat (usec): min=0, max=3099.8K, avg=4131.91, stdev=21388.98
      clat (usec): min=850, max=3133.1K, avg=43337.56, stdev=93830.42
       lat (usec): min=930, max=3147.5K, avg=48253.22, stdev=100642.53
      clat percentiles (msec):
       |  1.00th=[    5],  5.00th=[   11], 10.00th=[   14],
20.00th=[   19], | 30.00th=[   24], 40.00th=[   29], 50.00th=[   33],
60.00th=[   36], | 70.00th=[   39], 80.00th=[   43], 90.00th=[   51],
95.00th=[   68], | 99.00th=[  506], 99.50th=[  553], 99.90th=[  717],
99.95th=[  783], | 99.99th=[ 3130]
      bw (KB  /s): min=    0, max=  680, per=1.54%, avg=355.39,
stdev=156.10 lat (usec) : 1000=0.01%
      lat (msec) : 2=0.12%, 4=0.66%, 10=4.21%, 20=17.82%, 50=66.95%
      lat (msec) : 100=7.34%, 250=0.78%, 500=1.10%, 750=0.99%,
1000=0.02% lat (msec) : >=2000=0.04%
    cpu          : usr=0.65%, sys=2.45%, ctx=1434322, majf=0, minf=2399
    IO depths    : 1=0.2%, 2=0.1%, 4=365.4%, 8=0.0%, 16=0.0%, 32=0.0%,
=64=0.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
=64=0.0%
       issued    : total=r=0/w=579510/d=0, short=r=0/w=0/d=0
       latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
    WRITE: io=2264.6MB, aggrb=23142KB/s, minb=23142KB/s, maxb=23142KB/s,
mint=100198msec, maxt=100198msec

Disk stats (read/write):
    rbd1: ios=0/2295106, merge=0/926648, ticks=0/39660664,
in_queue=39706288, util=99.80%



It seems that latency is more stable in the first case.

So I guess what comes to mind is when you have all of the fio processes
writing to files on a single file system there's now another whole layer
of locks and contention.  Not sure how likely this is though.  Josh
might be able to chime in if there's something on the RBD side that
could slow this kind of use case down.



In case 3, do you have multiple fio jobs going or just 1?
In all three cases, I am using one fio process with NUMJOBS=70

Is RBD cache enabled?  It's interesting that librbd is so much slower in
this case than kernel RBD for you.  If anything I would have expected
the opposite.

Come again?
User space RBD with the default values will have little to no impact in
this scenario.

Whereas kernel space RBD will be able to use every last byte of memory for
page cache, totally ousting users pace RBD.

I was under the impression this was direct IO?  Did I miss-read something?


Regards,

Christian




thanks in advance or any help,
Xabier
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to