O_DIRECT is _not_ a flag for synchronous blocking IO.
O_DIRECT only hints the kernel that it needs not cache/buffer the data.
The kernel is actually free to buffer and cache it and it does buffer it.
It also does _not_ flush O_DIRECT writes to disk but it makes best effort to 
send it to the drives ASAP (where it can sit in cache).
Finishing an O_DIRECT request doesn't guarantee it is on disk at all.

In effect, you can issue parallel O_DIRECT request and they will scale with 
queue depth, but the ordering is not guaranteed and neither is it crash safe.

btw "innodb_flush_log_at_trx_commit = 5" does not do what you think it does. 
It's only values are
0 - flush only periodically, not crash consistent (most data should be there 
somewhere but it does require a lengthy manual recovery)
1 - flush after every transaction (not every write as you illustrated), ACID 
2 - flush periodically, database *should* be crash consistent but you can lose 
some transactions

no other value does anything:

mysql> show global variables like "innodb_flush_log_at_trx_commit";
| Variable_name                  | Value |
| innodb_flush_log_at_trx_commit | 2     |
1 row in set (0.00 sec)

mysql> set global innodb_flush_log_at_trx_commit = 1;
Query OK, 0 rows affected (0.00 sec)

mysql> show global variables like "innodb_flush_log_at_trx_commit";
| Variable_name                  | Value |
| innodb_flush_log_at_trx_commit | 1     |
1 row in set (0.00 sec)

mysql> set global innodb_flush_log_at_trx_commit = 5;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> show global variables like "innodb_flush_log_at_trx_commit";
| Variable_name                  | Value |
| innodb_flush_log_at_trx_commit | 2     |
1 row in set (0.00 sec)

On Ceph, you either need to live with a max of ~ 200 (serializable) 
transactions/sec, settle for innodb_flush_log_at_trx_commit = 2 and lose the 
tail of transactions or you can put the innodb log files on a separate device 
(drbd accross several nodes, physical SSD...) which will survive a crash.


> On 26 Feb 2016, at 10:49, Huan Zhang <huan.zhang...@gmail.com> wrote:
> fio /dev/rbd0 sync=1 has no problem.
> Doesn't find 'sync cache code' in linux rbd block driver and radosgw api. 
> Seems sync cache is just the concept of librbd (for rbd cache). 
> Just my concerns.
> 2016-02-26 17:30 GMT+08:00 Huan Zhang <huan.zhang...@gmail.com 
> <mailto:huan.zhang...@gmail.com>>:
> Hi Nick,
> DB's IO pattern depends on config, mysql for example.
> innodb_flush_log_at_trx_commit =1, mysql will sync after one transcation. 
> like:
> write
> sync
> wirte
> sync
> ...
> innodb_flush_log_at_trx_commit = 5,
> write
> write
> write
> write
> write
> sync
> innodb_flush_log_at_trx_commit = 0,
> write
> write
> ...
> one second later.
> sync.
> may not very accurate, but more or less.
> We test mysql tps, with nnodb_flush_log_at_trx_commit =1, get very poor 
> performance even if we can reach very high O_DIRECT randwrite iops with fio.
> 2016-02-26 16:59 GMT+08:00 Nick Fisk <n...@fisk.me.uk 
> <mailto:n...@fisk.me.uk>>:
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> > <mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of
> > Huan Zhang
> > Sent: 26 February 2016 06:50
> > To: Jason Dillaman <dilla...@redhat.com <mailto:dilla...@redhat.com>>
> > Cc: josh durgin <josh.dur...@inktank.com <mailto:josh.dur...@inktank.com>>; 
> > Nick Fisk <n...@fisk.me.uk <mailto:n...@fisk.me.uk>>;
> > ceph-users <ceph-us...@ceph.com <mailto:ceph-us...@ceph.com>>
> > Subject: Re: [ceph-users] Guest sync write iops so poor.
> >
> > rbd engine with fsync=1 seems stuck.
> > Jobs: 1 (f=1): [w(1)] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
> > 1244d:10h:39m:18s]
> >
> > But fio using /dev/rbd0 sync=1 direct=1 ioengine=libaio iodepth=64, get very
> > high iops ~35K, similar to direct wirte.
> >
> > I'm confused with that result, IMHO, ceph could just ignore the sync cache
> > command since it always use sync write to journal, right?
> Even if the data is not sync'd to the data storage part of the OSD, the data 
> still has to be written to the journal and this is where the performance 
> limit lies.
> The very nature of SDS means that you are never going to achieve the same 
> latency as you do to a local disk as even if the software side introduced no 
> extra latency, just the network latency will severely limit your sync 
> performance.
> Do you know the IO pattern the DB's generate? I know you can switch most DB's 
> to flush with O_DIRECT instead of sync, it might be this helps in your case.
> Also check out the tech talk from last month about high performance databases 
> on Ceph. The presenter gave the impression that, at least in their case, not 
> every write was a sync IO. So your results could possibly matter less than 
> you think.
> Also please search the lists and past presentations about reducing write 
> latency. There are a few things you can do like disabling logging and some 
> kernel parameters to stop the CPU's entering sleep states/reducing frequency. 
> One thing I witnessed that if the Ceph cluster is only running at low queue 
> depths, so it's only generating low cpu load, all the cores on the CPU's 
> throttle themselves down to their lowest speeds, which really hurts latency.
> >
> > Why we get so bad sync iops, how ceph handle it?
> > Very appreciated to get your reply!
> >
> > 2016-02-25 22:44 GMT+08:00 Jason Dillaman <dilla...@redhat.com 
> > <mailto:dilla...@redhat.com>>:
> > > 35K IOPS with ioengine=rbd sounds like the "sync=1" option doesn't
> > actually
> > > work. Or it's not touching the same object (but I wonder whether write
> > > ordering is preserved at that rate?).
> >
> > The fio rbd engine does not support "sync=1"; however, it should support
> > "fsync=1" to accomplish roughly the same effect.
> >
> > Jason
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

ceph-users mailing list

Reply via email to