O_DIRECT is _not_ a flag for synchronous blocking IO. O_DIRECT only hints the kernel that it needs not cache/buffer the data. The kernel is actually free to buffer and cache it and it does buffer it. It also does _not_ flush O_DIRECT writes to disk but it makes best effort to send it to the drives ASAP (where it can sit in cache). Finishing an O_DIRECT request doesn't guarantee it is on disk at all.
In effect, you can issue parallel O_DIRECT request and they will scale with queue depth, but the ordering is not guaranteed and neither is it crash safe. btw "innodb_flush_log_at_trx_commit = 5" does not do what you think it does. It's only values are 0 - flush only periodically, not crash consistent (most data should be there somewhere but it does require a lengthy manual recovery) 1 - flush after every transaction (not every write as you illustrated), ACID compliant 2 - flush periodically, database *should* be crash consistent but you can lose some transactions no other value does anything: mysql> show global variables like "innodb_flush_log_at_trx_commit"; +--------------------------------+-------+ | Variable_name | Value | +--------------------------------+-------+ | innodb_flush_log_at_trx_commit | 2 | +--------------------------------+-------+ 1 row in set (0.00 sec) mysql> set global innodb_flush_log_at_trx_commit = 1; Query OK, 0 rows affected (0.00 sec) mysql> show global variables like "innodb_flush_log_at_trx_commit"; +--------------------------------+-------+ | Variable_name | Value | +--------------------------------+-------+ | innodb_flush_log_at_trx_commit | 1 | +--------------------------------+-------+ 1 row in set (0.00 sec) mysql> set global innodb_flush_log_at_trx_commit = 5; Query OK, 0 rows affected, 1 warning (0.00 sec) mysql> show global variables like "innodb_flush_log_at_trx_commit"; +--------------------------------+-------+ | Variable_name | Value | +--------------------------------+-------+ | innodb_flush_log_at_trx_commit | 2 | +--------------------------------+-------+ 1 row in set (0.00 sec) On Ceph, you either need to live with a max of ~ 200 (serializable) transactions/sec, settle for innodb_flush_log_at_trx_commit = 2 and lose the tail of transactions or you can put the innodb log files on a separate device (drbd accross several nodes, physical SSD...) which will survive a crash. Jan > On 26 Feb 2016, at 10:49, Huan Zhang <huan.zhang...@gmail.com> wrote: > > fio /dev/rbd0 sync=1 has no problem. > Doesn't find 'sync cache code' in linux rbd block driver and radosgw api. > Seems sync cache is just the concept of librbd (for rbd cache). > Just my concerns. > > 2016-02-26 17:30 GMT+08:00 Huan Zhang <huan.zhang...@gmail.com > <mailto:huan.zhang...@gmail.com>>: > Hi Nick, > DB's IO pattern depends on config, mysql for example. > innodb_flush_log_at_trx_commit =1, mysql will sync after one transcation. > like: > write > sync > wirte > sync > ... > > innodb_flush_log_at_trx_commit = 5, > write > write > write > write > write > sync > > innodb_flush_log_at_trx_commit = 0, > write > write > ... > one second later. > sync. > > > may not very accurate, but more or less. > > We test mysql tps, with nnodb_flush_log_at_trx_commit =1, get very poor > performance even if we can reach very high O_DIRECT randwrite iops with fio. > > > > > 2016-02-26 16:59 GMT+08:00 Nick Fisk <n...@fisk.me.uk > <mailto:n...@fisk.me.uk>>: > > -----Original Message----- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com > > <mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of > > Huan Zhang > > Sent: 26 February 2016 06:50 > > To: Jason Dillaman <dilla...@redhat.com <mailto:dilla...@redhat.com>> > > Cc: josh durgin <josh.dur...@inktank.com <mailto:josh.dur...@inktank.com>>; > > Nick Fisk <n...@fisk.me.uk <mailto:n...@fisk.me.uk>>; > > ceph-users <ceph-us...@ceph.com <mailto:ceph-us...@ceph.com>> > > Subject: Re: [ceph-users] Guest sync write iops so poor. > > > > rbd engine with fsync=1 seems stuck. > > Jobs: 1 (f=1): [w(1)] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta > > 1244d:10h:39m:18s] > > > > But fio using /dev/rbd0 sync=1 direct=1 ioengine=libaio iodepth=64, get very > > high iops ~35K, similar to direct wirte. > > > > I'm confused with that result, IMHO, ceph could just ignore the sync cache > > command since it always use sync write to journal, right? > > Even if the data is not sync'd to the data storage part of the OSD, the data > still has to be written to the journal and this is where the performance > limit lies. > > The very nature of SDS means that you are never going to achieve the same > latency as you do to a local disk as even if the software side introduced no > extra latency, just the network latency will severely limit your sync > performance. > > Do you know the IO pattern the DB's generate? I know you can switch most DB's > to flush with O_DIRECT instead of sync, it might be this helps in your case. > > Also check out the tech talk from last month about high performance databases > on Ceph. The presenter gave the impression that, at least in their case, not > every write was a sync IO. So your results could possibly matter less than > you think. > > Also please search the lists and past presentations about reducing write > latency. There are a few things you can do like disabling logging and some > kernel parameters to stop the CPU's entering sleep states/reducing frequency. > One thing I witnessed that if the Ceph cluster is only running at low queue > depths, so it's only generating low cpu load, all the cores on the CPU's > throttle themselves down to their lowest speeds, which really hurts latency. > > > > > Why we get so bad sync iops, how ceph handle it? > > Very appreciated to get your reply! > > > > 2016-02-25 22:44 GMT+08:00 Jason Dillaman <dilla...@redhat.com > > <mailto:dilla...@redhat.com>>: > > > 35K IOPS with ioengine=rbd sounds like the "sync=1" option doesn't > > actually > > > work. Or it's not touching the same object (but I wonder whether write > > > ordering is preserved at that rate?). > > > > The fio rbd engine does not support "sync=1"; however, it should support > > "fsync=1" to accomplish roughly the same effect. > > > > Jason > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com