Re: [ceph-users] 答复: How does rbd preserve the consistency of WRITE requests that span across multiple objects?

Jason Dillaman Wed, 24 May 2017 19:23:06 -0700

A userspace application should issue fsync or fdatasync calls where
appropriate.


On Wed, May 24, 2017 at 10:15 PM, 许雪寒 <xuxue...@360.cn> wrote:
> Thanks for your reply:-)
>
> I've got your point. By the way, if an application opens a file WITHOUT 
> setting the O_DIRECT or O_SYNC, then it sequentially issues two overlapping 
> glibc write operations to the underlying file system. As far as I understand 
> the linux file system, those writes might not be written to the disk when the 
> function call "write" returns, then how does the file system insure that the 
> result of those two writes are as expected? Does it merge those two 
> operations, or synchronously issue those writes to the disk? If the latter, 
> does the file system insert some other operations, like io barrier, between 
> those to writes so that the underlying storage system is aware of the case?
>
> -----邮件原件-----
> 发件人: Jason Dillaman [mailto:jdill...@redhat.com]
> 发送时间: 2017年5月24日 23:05
> 收件人: 许雪寒
> 抄送: ceph-users@lists.ceph.com
> 主题: Re: [ceph-users] How does rbd preserve the consistency of WRITE requests 
> that span across multiple objects?
>
> Just like a regular block device, re-orders are permitted between write 
> barriers/flushes. For example, if I had a HDD with 512 byte sectors and I 
> attempted to write 4K, there is no guarantee what the disk will look like if 
> you had a crash mid-write or if you concurrently issued an overlapping write. 
> The correct way your application should behave (regardless of using RBD or 
> HDDs or SSDs) would be to wait for the first write to complete before issuing 
> the overlapping write.
>
> On Tue, May 23, 2017 at 11:29 PM, 许雪寒 <xuxue...@360.cn> wrote:
>> Hi, thanks for the explanation:-)
>>
>> On the other hand, I wonder if the following scenario could happen:
>>
>>         A program in a virtual machine that uses "libaio" to access a file 
>> continuous submit "write" requests to the underlying file system which 
>> translates the request into rbd requests. Say, a rbd "aio_write" X wants to 
>> write to an area that span across object A and B. according to my 
>> understanding of the rbd source code, librbd would separate this write 
>> request into two rados Ops, each corresponding to a single object. After 
>> these two rados Ops have been sent to OSD and before they are finished, 
>> another rbd "aio_write" request Y which also wants to write to the same area 
>> as the previous arrives, and is sent to OSD in the same way as X. Due to the 
>> possible reorder, it's possible that Y.B is done before X.B while Y.A is 
>> done after X.A, which could lead to an unexpected result.
>>
>> Is this possible?
>>
>>
>> Date: Fri, 10 Mar 2017 19:27:00 +0000
>> From: Gregory Farnum <gfar...@redhat.com>
>> To: Wei Jin <wjin...@gmail.com>,        "ceph-users@lists.ceph.com"
>>         <ceph-users@lists.ceph.com>, ??? <xuxue...@360.cn>
>> Subject: Re: [ceph-users] ??: How does ceph preserve read/write
>>         consistency?
>> Message-ID:
>>
>> <CAJ4mKGYP1OkAGYCgv=y5csbmvakbqh+ngztps45pywawlut...@mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> On Thu, Mar 9, 2017 at 7:20 PM ??? <xuxue...@360.cn> wrote:
>>
>>> Thanks for your reply.
>>>
>>> As the log shows, in our test, a READ that come after a WRITE did
>>> finished before that WRITE.
>>
>>
>> This is where you've gone astray. Any storage system is perfectly free
>> to reorder simultaneous requests -- defined as those whose
>> submit-reply time overlaps. So you submitted write W, then submitted
>> read R, then got a response to R before W. That's allowed, and
>> preventing it is actually impossible in general. In the specific case
>> you've outlined, we *could* try to prevent it, but doing so is pretty
>> ludicrously expensive and, since the "reorder" can happen anyway, doesn't 
>> provide any benefit.
>> So we don't try. :)
>>
>> That said, obviously we *do* provide strict ordering across write
>> boundaries: a read submitted after a write completed will always see
>> the results of that write.
>> -Greg
>>
>> And I read the source code, it seems that, for writes, in
>>> ReplicatedPG::do_op method, the thread in OSD_op_tp calls
>>> ReplicatedPG::get_rw_lock method which tries to get RWState::RWWRITE.
>>> If it fails, the op will be put into obc->rwstate.waiters queue and
>>> be requeued when repop finishes, however, the OSD_op_tp's thread
>>> doesn't wait for repop and tries to get the next OP. Can this be the cause?
>>>
>>> -----????-----
>>> ???: Wei Jin [mailto:wjin...@gmail.com]
>>> ????: 2017?3?9? 21:52
>>> ???: ???
>>> ??: ceph-users@lists.ceph.com
>>> ??: Re: [ceph-users] How does ceph preserve read/write consistency?
>>>
>>> On Thu, Mar 9, 2017 at 1:45 PM, ??? <xuxue...@360.cn> wrote:
>>> > Hi, everyone.
>>>
>>> > As shown above, WRITE req with tid 1312595 arrived at
>>> > 18:58:27.439107
>>> and READ req with tid 6476 arrived at 18:59:55.030936, however, the
>>> latter finished at 19:00:20:333389 while the former finished commit
>>> at
>>> 19:00:20.335061 and filestore write at 19:00:25.202321. And in these
>>> logs, we found that between the start and finish of each req, there
>>> was a lot of "dequeue_op" of that req. We read the source code, it
>>> seems that this is due to "RWState", is that correct?
>>> >
>>> > And also, it seems that OSD won't distinguish reqs from different
>>> clients, so is it possible that io reqs from the same client also
>>> finish in a different order than that they were created in? Could
>>> this affect the read/write consistency? For instance, that a read
>>> can't acquire the data that were written by the same client just before it.
>>> >
>>>
>>> IMO, that doesn't make sense for rados to distinguish reqs from
>>> different clients.
>>> Clients or Users should do it by themselves.
>>>
>>> However, as for one specific client, ceph can and must guarantee the
>>> request order.
>>>
>>> 1) ceph messenger (network layer) has in_seq and out_seq when
>>> receiving and sending message
>>>
>>> 2) message will be dispatched or fast dispatched and then be queued
>>> in ShardedOpWq in order.
>>>
>>> If requests belong to different pgs, they may be processed
>>> concurrently, that's ok.
>>>
>>> If requests belong to the same pg, they will be queued in the same
>>> shard and will be processed in order due to pg lock (both read and write).
>>> For continuous write, op will be queued in ObjectStore in order due
>>> to pg lock and ObjectStore has OpSequence to guarantee the order when
>>> applying op to page cache, that's ok.
>>>
>>> With regard to  'read after write' to the same object, ceph must
>>> guarantee read can get the correct write content. That's done by
>>> ondisk_read/write_lock in ObjectContext.
>>>
>>>
>>> > We are testing hammer version, 0.94.5.  Please help us, thank
>>> > you:-) _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason



-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 答复: How does rbd preserve the consistency of WRITE requests that span across multiple objects?

Reply via email to