On 24/04/2025 13:57, 段世博 wrote:
We are currently using the Ceph Octopus version and have some questions
regarding a specific commit in librbd (
https://github.com/ceph/ceph/commit/081d28ae7ca46fd1f40034cc558def77a95a9294
).

Could you please clarify why Ceph implements ordering for overlapping IO?
Our understanding is that overlapping IO in block storage can typically be
parallelized. We are curious about which components of Ceph depend on this
feature. Additionally, we noticed that this feature was removed in the
subsequent Pacific version. Could you provide some insight into the reasons
behind this decision?

Thank you for your assistance.

Best regards,

shibo




1) Generally and not specific to ceph:
i) If a client issues 2 overlapped write ios A then io B. If A returns first then B returns, the client knows that both ios were written to storage, but cannot assume A was written before B, it cannot assume order since they are overlapped. ii) If after A returns, the client issues io C which also returns, it can assume C was written after A.

2) The above is valid if no caching/buffering is involved in the middle. Caching can be located on the client (OS/page cache, librbd cache), or the storage server (server/hardware cache, drive cache..). Caching provides better performance but in case of crash, data in cache will be lost and it does not guarantee the same write order. Client application specify if they want to disable OS cache (direct flag) or server cache (sync flag). Clients that enable caching can issue "cache flush" at specfic times to guarantee when data is saved and have control on save order.

3) Ceph OSDs do not cache writes, there is no server cache, all i/o is considered (sync flag).
If you write client app that uses librados, then rules in 1) apply
If you use kernel rbd mapped devices and write with (direct flag) to avoid page cache,then rules in 1) apply.
If you use librbd and disable librbd cache,then rules in 1) apply.

4) Regarding parallel behavior: concurrent i/o will execute in parallel across the many OSDs as well as within the OSD. However within the OSD, writes with the same pg number will be serialized, this is an implementation detail mainly to insure integrity of the pg metadata structures, it is not related to overlapped i/o ordering.
So yes, overlapped i/o to block storage will execute in parallel.

5) I had a quick look at the changes/patches you mention, they do not alter the above. From what i understood it seems to be related to when using librbd and have librbd cache enabled, it enhances/fixes behavior of cache flushes, maybe like delaying execution of a flush if prev i/o or prev flush was inflight and not yet completed, so in case of crash we do not have image with new writes but not old.

/maged

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to