A userspace application should issue fsync or fdatasync calls where appropriate.
On Wed, May 24, 2017 at 10:15 PM, 许雪寒 <xuxue...@360.cn> wrote: > Thanks for your reply:-) > > I've got your point. By the way, if an application opens a file WITHOUT > setting the O_DIRECT or O_SYNC, then it sequentially issues two overlapping > glibc write operations to the underlying file system. As far as I understand > the linux file system, those writes might not be written to the disk when the > function call "write" returns, then how does the file system insure that the > result of those two writes are as expected? Does it merge those two > operations, or synchronously issue those writes to the disk? If the latter, > does the file system insert some other operations, like io barrier, between > those to writes so that the underlying storage system is aware of the case? > > -----邮件原件----- > 发件人: Jason Dillaman [mailto:jdill...@redhat.com] > 发送时间: 2017年5月24日 23:05 > 收件人: 许雪寒 > 抄送: ceph-users@lists.ceph.com > 主题: Re: [ceph-users] How does rbd preserve the consistency of WRITE requests > that span across multiple objects? > > Just like a regular block device, re-orders are permitted between write > barriers/flushes. For example, if I had a HDD with 512 byte sectors and I > attempted to write 4K, there is no guarantee what the disk will look like if > you had a crash mid-write or if you concurrently issued an overlapping write. > The correct way your application should behave (regardless of using RBD or > HDDs or SSDs) would be to wait for the first write to complete before issuing > the overlapping write. > > On Tue, May 23, 2017 at 11:29 PM, 许雪寒 <xuxue...@360.cn> wrote: >> Hi, thanks for the explanation:-) >> >> On the other hand, I wonder if the following scenario could happen: >> >> A program in a virtual machine that uses "libaio" to access a file >> continuous submit "write" requests to the underlying file system which >> translates the request into rbd requests. Say, a rbd "aio_write" X wants to >> write to an area that span across object A and B. according to my >> understanding of the rbd source code, librbd would separate this write >> request into two rados Ops, each corresponding to a single object. After >> these two rados Ops have been sent to OSD and before they are finished, >> another rbd "aio_write" request Y which also wants to write to the same area >> as the previous arrives, and is sent to OSD in the same way as X. Due to the >> possible reorder, it's possible that Y.B is done before X.B while Y.A is >> done after X.A, which could lead to an unexpected result. >> >> Is this possible? >> >> >> Date: Fri, 10 Mar 2017 19:27:00 +0000 >> From: Gregory Farnum <gfar...@redhat.com> >> To: Wei Jin <wjin...@gmail.com>, "ceph-users@lists.ceph.com" >> <ceph-users@lists.ceph.com>, ??? <xuxue...@360.cn> >> Subject: Re: [ceph-users] ??: How does ceph preserve read/write >> consistency? >> Message-ID: >> >> <CAJ4mKGYP1OkAGYCgv=y5csbmvakbqh+ngztps45pywawlut...@mail.gmail.com> >> Content-Type: text/plain; charset="utf-8" >> >> On Thu, Mar 9, 2017 at 7:20 PM ??? <xuxue...@360.cn> wrote: >> >>> Thanks for your reply. >>> >>> As the log shows, in our test, a READ that come after a WRITE did >>> finished before that WRITE. >> >> >> This is where you've gone astray. Any storage system is perfectly free >> to reorder simultaneous requests -- defined as those whose >> submit-reply time overlaps. So you submitted write W, then submitted >> read R, then got a response to R before W. That's allowed, and >> preventing it is actually impossible in general. In the specific case >> you've outlined, we *could* try to prevent it, but doing so is pretty >> ludicrously expensive and, since the "reorder" can happen anyway, doesn't >> provide any benefit. >> So we don't try. :) >> >> That said, obviously we *do* provide strict ordering across write >> boundaries: a read submitted after a write completed will always see >> the results of that write. >> -Greg >> >> And I read the source code, it seems that, for writes, in >>> ReplicatedPG::do_op method, the thread in OSD_op_tp calls >>> ReplicatedPG::get_rw_lock method which tries to get RWState::RWWRITE. >>> If it fails, the op will be put into obc->rwstate.waiters queue and >>> be requeued when repop finishes, however, the OSD_op_tp's thread >>> doesn't wait for repop and tries to get the next OP. Can this be the cause? >>> >>> -----????----- >>> ???: Wei Jin [mailto:wjin...@gmail.com] >>> ????: 2017?3?9? 21:52 >>> ???: ??? >>> ??: ceph-users@lists.ceph.com >>> ??: Re: [ceph-users] How does ceph preserve read/write consistency? >>> >>> On Thu, Mar 9, 2017 at 1:45 PM, ??? <xuxue...@360.cn> wrote: >>> > Hi, everyone. >>> >>> > As shown above, WRITE req with tid 1312595 arrived at >>> > 18:58:27.439107 >>> and READ req with tid 6476 arrived at 18:59:55.030936, however, the >>> latter finished at 19:00:20:333389 while the former finished commit >>> at >>> 19:00:20.335061 and filestore write at 19:00:25.202321. And in these >>> logs, we found that between the start and finish of each req, there >>> was a lot of "dequeue_op" of that req. We read the source code, it >>> seems that this is due to "RWState", is that correct? >>> > >>> > And also, it seems that OSD won't distinguish reqs from different >>> clients, so is it possible that io reqs from the same client also >>> finish in a different order than that they were created in? Could >>> this affect the read/write consistency? For instance, that a read >>> can't acquire the data that were written by the same client just before it. >>> > >>> >>> IMO, that doesn't make sense for rados to distinguish reqs from >>> different clients. >>> Clients or Users should do it by themselves. >>> >>> However, as for one specific client, ceph can and must guarantee the >>> request order. >>> >>> 1) ceph messenger (network layer) has in_seq and out_seq when >>> receiving and sending message >>> >>> 2) message will be dispatched or fast dispatched and then be queued >>> in ShardedOpWq in order. >>> >>> If requests belong to different pgs, they may be processed >>> concurrently, that's ok. >>> >>> If requests belong to the same pg, they will be queued in the same >>> shard and will be processed in order due to pg lock (both read and write). >>> For continuous write, op will be queued in ObjectStore in order due >>> to pg lock and ObjectStore has OpSequence to guarantee the order when >>> applying op to page cache, that's ok. >>> >>> With regard to 'read after write' to the same object, ceph must >>> guarantee read can get the correct write content. That's done by >>> ondisk_read/write_lock in ObjectContext. >>> >>> >>> > We are testing hammer version, 0.94.5. Please help us, thank >>> > you:-) _______________________________________________ >>> > ceph-users mailing list >>> > ceph-users@lists.ceph.com >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Jason -- Jason _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com