The client ACKs the write as soon as it is in the journal. I suspect that
the primary OSD dispatches the write to all the secondary OSDs at the same
time so that it happens in parallel, but I am not an authority on that.

The journal writes data serially even if it comes in randomly. There is
some time that data is allowed to sit in the journal before it has to be
flushed to the disk. When the data is flushed, it can reorder and
consolidate the writes in batches so that it can be as efficient as
possible.

That is why SSD journals can offer large performance improvements for Ceph
because the client ACKs as soon as the journal write is done and a lot of
the random access is done by the SSDs. Remember that on a spindle, the
journal has to be interrupted for reads and the head is having to travel
all over the disk, SSDs help buffer that a lot so that spindles can spend
more time servicing reads.

On Fri, Apr 24, 2015 at 11:40 AM, J David <j.david.li...@gmail.com> wrote:

> On Fri, Apr 24, 2015 at 10:58 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> > 7.2k drives tend to do about 80 iops at 4kb IO sizes, as the IO size
> > increases the number of iops will start to fall. You will probably get
> > around 70 iops for 128kb. But please benchmark your raw disks to get some
> > accurate numbers if needed.
> >
> > Next when you use on-disk journals you write 1st to the journal and then
> > write the actual data. There is also a small levelDB write which stores
> ceph
> > metadata so depending on IO size you will get slightly less than half the
> > native disk performance.
> >
> > You then have 2 copies, as Ceph won't ACK until both copies have been
> > written the average latency will tend to stray upwards.
>
> What is the purpose of the journal if Ceph waits for the actual write
> to complete anyway?
>
> I.e. with a hardware raid card with a BBU, the raid card tells the
> host that the data is guaranteed safe as soon as it has been written
> to the BBU.
>
> Does this also mean that all the writing internal to ceph happens
> synchronously?
>
> I.e. all these operations are serialized:
>
> copy1-journal-write -> copy1-data-write -> copy2-journal-write ->
> copy2-data-write -> OK, client, you're done.
>
> Since copy1 and copy2 are on completely different physical hardware,
> shouldn't those operations be able to proceed more or less
> independently?  And shouldn't the client be done as soon as the
> journal is written?  I.e.:
>
> copy1-journal-write -v- copy1-data-write
> copy2-journal-write -|- copy1-data-write
>                              +-> OK, client, you're done
>
> If so, shouldn't the effective latency be that of one operation, not
> four?  Plus all the non-trivial overhead for scheduling, LevelDB,
> network latency, etc.
>
> For the "getting jackhammered by zillions of clients" case, your
> estimate probably holds more true, because even if writes aren't in
> the critical path they still happen and sooner or later the drive runs
> out of IOPs and things start getting in each others' way.  But for a
> single client, single thread case where the cluster is *not* 100%
> utilized, shouldn't the effective latency be much less?
>
> The other thing about this that I don't quite understand, and the
> thing initially had me questioning whether there was something wrong
> on the Ceph side is that your estimate is based primarily on the
> mechanical capabilities of the drives.  Yet, in practice, when the
> Ceph cluster is tapped out for I/O in this situation, iostat says none
> of the physical drives are more than 10-20% busy and doing 10-20 IOPs
> to write a couple of MB/sec.  And those are the "loaded" ones at any
> given time.  Many are <10%.  In fact, *none* of the hardware on the
> Ceph side is anywhere close to fully utilized.  If the performance of
> this cluster is limited by its hardware, shouldn't there be some
> evidence of that somewhere?
>
> To illustrate, I marked a physical drive out and waited for things to
> settle down, then ran fio on the physical drive (128KB randwrite
> numjobs=1 iodepth=1).  It yields a very different picture of the
> drive's physical limits.
>
> The drive during "maxxed out" client writes:
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdl               0.00     0.20    4.80   13.40    23.60  2505.65
> 277.94     0.26   14.07   16.08   13.34   6.68  12.16
>
> The same drive under fio:
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdl               0.00     0.00    0.00  377.50     0.00 48320.00
> 256.00     0.99    2.62    0.00    2.62   2.62  98.72
>
> You could make the argument that for we are seeing half the throughput
> on the same test because ceph is write-doubling (journal+data) and the
> reason no drive is highly utilized is because the load is being spread
> out.  So each of 28 drives actually is being maxed out, but only 3.5%
> of the time, leading to low apparent utilization because the
> measurement interval is too long.  And maybe that is exactly what is
> happening.  For that to be true, the two OSD writes would indeed have
> to happen in parallel, not sequentially.  (Which is what it's supposed
> to do, I believe?)
>
> But why does a client have to wait for both writes?  Isn't the journal
> enough?  If it isn't, shouldn't it be?  And if it isn't, wouldn't
> moving to even an infinitely fast SSD journal only double the
> performance, since the second write still has to happen?
>
> In case they are of interest, the native drive fio results are below.
>
> testfile: (groupid=0, jobs=1): err= 0: pid=20562
>   write: io=30720MB, bw=47568KB/s, iops=371 , runt=661312msec
>     slat (usec): min=13 , max=4087 , avg=34.08, stdev=25.36
>     clat (usec): min=2 , max=736605 , avg=2650.22, stdev=6368.02
>      lat (usec): min=379 , max=736640 , avg=2684.80, stdev=6368.00
>     clat percentiles (usec):
>      |  1.00th=[  466],  5.00th=[ 1576], 10.00th=[ 1800], 20.00th=[ 1992],
>      | 30.00th=[ 2128], 40.00th=[ 2224], 50.00th=[ 2320], 60.00th=[ 2416],
>      | 70.00th=[ 2512], 80.00th=[ 2640], 90.00th=[ 2864], 95.00th=[ 3152],
>      | 99.00th=[10688], 99.50th=[20352], 99.90th=[29056], 99.95th=[29568],
>      | 99.99th=[452608]
>     bw (KB/s)  : min= 1022, max=88910, per=100.00%, avg=47982.41,
> stdev=7115.74
>     lat (usec) : 4=0.01%, 500=1.52%, 750=1.23%, 1000=0.14%
>     lat (msec) : 2=17.32%, 4=76.47%, 10=1.41%, 20=1.40%, 50=0.49%
>     lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
>   cpu          : usr=0.56%, sys=1.21%, ctx=252044, majf=0, minf=21
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      issued    : total=r=0/w=245760/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
>   WRITE: io=30720MB, aggrb=47567KB/s, minb=47567KB/s, maxb=47567KB/s,
> mint=661312msec, maxt=661312msec
>
> Disk stats (read/write):
>   sdl: ios=0/245789, merge=0/0, ticks=0/666944, in_queue=666556,
> util=98.28%
>
> Thanks!
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to