Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

Oliver Freyermuth Thu, 01 Aug 2019 05:25:34 -0700

Hi together,

Am 01.08.19 um 08:45 schrieb Janne Johansson:

Den tors 1 aug. 2019 kl 07:31 skrev Muhammad Junaid <junaid.fsd...@gmail.com 
<mailto:junaid.fsd...@gmail.com>>:


    Your email has cleared many things to me. Let me repeat my understanding. 
Every Critical data (Like Oracle/Any Other DB) writes will be done with sync, 
fsync flags, meaning they will be only confirmed to DB/APP after it is actually 
written to Hard drives/OSD's. Any other application can do it also.

All other writes, like OS logs etc will be confirmed immediately to app/user but later on written passing through kernel, RBD Cache, Physical drive Cache (If any) and then to disks. These are susceptible to power-failure-loss but overall things are recoverable/non-critical.


That last part is probably simplified a bit, I suspect between a program in a 
guest sending its data to the virtualised device, running in a KVM on top of an 
OS that has remote storage over network, to a storage server with its own OS 
and drive controller chip and lastly physical drive(s) to store the write, 
there will be something like ~10 layers of write caching possible, out of which 
the RBD you were asking about, is just one.

It is just located very conveniently before the I/O has to leave the KVM host 
and go back and forth over the network, so it is the last place where you can 
see huge gains in the guests I/O response time, but at the same time possible 
to share between lots of guests on the KVM host which should have tons of RAM 
available compared to any single guest so it is a nice way to get a large cache 
for outgoing writes.

Also, to answer your first part, yes all critical software that depend heavily 
on write ordering and integrity is hopefully already doing write operations 
that way, asking for sync(), fsync() or fdatasync() and similar calls, but I 
can't produce a list of all programs that do. Since there already are many 
layers of delayed cached writes even without virtualisation and/or ceph, 
applications that are important have mostly learned their lessons by now, so 
chances are very high that all your important databases and similar program are 
doing the right thing.


Just to add on this: One such software, for which people cared a lot, is of 
course a file system itself. BTRFS is notably a candidate very sensitive to 
broken flush / FUA ( 
https://en.wikipedia.org/wiki/Disk_buffer#Force_Unit_Access_(FUA) ) 
implementations at any layer of the I/O path due to the rather complicated 
metadata structure.
While for in-kernel and other open source software (such as librbd), there are 
usually a lot of people checking the code for a correct implementation and 
testing things, there is also broken hardware
(or rather, firmware) in the wild.

But there are even software issues around, if you think more general and strive 
for data correctness (since also corruption can happen at any layer):
I was hit by an in-kernel issue in the past (network driver writing network statistics 
via DMA to the wrong memory location - "sometimes")
corrupting two BTRFS partitions of mine, and causing random crashes in browsers 
and mail client apps. BTRFS has been hardened only in kernel 5.2 to check the 
metadata tree before flushing it to disk.

If you are curious about known hardware issues, check out this lengthy, but 
very insightful mail on the linux-btrfs list:
https://lore.kernel.org/linux-btrfs/20190623204523.gc11...@hungrycats.org/
As you can learn there, there are many drive and firmware combinations out 
there which do not implement flush / FUA correctly and your BTRFS may be 
corrupted after a power failure. The very same thing can happen to Ceph,
but with replication across several OSDs and lower probability to have broken 
disks in all hosts makes this issue less likely.

For what it is worth, we also use writeback caching for our virtualization 
cluster and are very happy with it - we also tried pulling power plugs on 
hypervisors, MONs and OSDs at random times during writes and ext4 could always 
recover easily with an fsck
making use of the journal.

Cheers and HTH,
        Oliver


But if the guest is instead running a mail filter that does antivirus checks, spam checks 
and so on, operating on files that live on the machine for something like one second, and 
then either get dropped or sent to the destination mailbox somewhere else, then having 
aggressive write caches would be very useful, since the effects of a crash would still 
mostly mean "the emails that were in the queue were lost, not acked by the final 
mailserver and will probably be resent by the previous smtp server". For such a 
guest VM, forcing sync writes would only be a net loss, it would gain much by having 
large ram write caches.

--
May the most significant bit of your life be positive.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

Reply via email to