Hi Dan, I can try to find the thread and the link again. I should mention that my inbox is a mess and our search function on the outlook 365 app is, well, don't mention the war. Is there a "list by thread" option on the lists.ceph.io? I can go through threads for 2 years, but not all messages.
> ceph could disable the write cache itself I thought the newer versions were doing that already, but it looks like there is only a udev rule recommended: https://github.com/ceph/ceph/pull/43848/files. I think the write cache issue is mostly relevant for consumer grade- or low-level datacenter hardware, they need to simulate performance with cheap components. I have never seen an enterprise SAS drive with write cache enabled. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Dan van der Ster <d...@vanderster.com> Sent: 01 December 2021 11:28:03 To: Frank Schilder Cc: huxia...@horebdata.cn; YiteGu; ceph-users Subject: Re: [ceph-users] Re: Rocksdb: Corruption: missing start of fragmented record(1) Hi Frank, I'd be interested to read that paper, if you can find it again. I don't understand why the volatile cache + fsync might be dangerous due to a buggy firmware, but yet we should trust that a firmware respects FUA when the volatile cache is disabled. In https://github.com/ceph/ceph/pull/43848 we're documenting the implications of WCE -- but in the context of performance, not safety. If write through / volatile cache off is required for safety too, then we should take a different approach (e.g. ceph could disable the write cache itself). Cheers, dan On Tue, Nov 30, 2021 at 9:36 AM Frank Schilder <fr...@dtu.dk> wrote: > > Hi Dan. > > > ...however it is not unsafe to leave the cache enabled -- ceph uses > > fsync appropriately to make the writes durable. > > Actually it is. You will rely on the drive's firmware to implement this > correctly and this is, unfortunately, less than a given. Within the last > one-two years somebody posted a link to a very interesting research paper to > this list, where drives were tested under real conditions. Turns out that the > "fsync to make writes persistent" is very vulnerable to power loss if > volatile write cache is enabled. It I remember correctly, about 1-2% of > drives ended up with data loss every time. In other words, for every drive > with volatile write cache enabled, every 100 power loss events you will have > 1-2 data loss events (in certain situations, the drive replies with ack > before the volatile cache is actually flushed). I think even PLP did not > prevent data loss in all cases. > > Its all down to bugs in firmware that fail to catch all corner cases and > internal race conditions with ops scheduling. Vendors will very often take > priority for performance over fixing a rare race condition and I will not > take nor recommend to take chances. > > I think this kind of advice should really not be given in a ceph context > without also referring to the pre-requisites: perfect firmware. Ceph is a > scale-out system and any large sized cluster will have enough drives to see > low-probability events on a regular basis. At least recommend to test that > thoroughly, that is, perform power-loss tests under load, and I mean many > power loss events per drive with randomised intervals under different load > patterns. > > Same applies to disk controllers with cache. Nobody recommends using the > controller cache because of firmware bugs that seem to be present in all > models. We have sufficient cases on this list for data loss after power loss > with controller cache being the issue. The recommendation is to enable HBA > mode and write-through. Do the same with your disk firmware, get better sleep > and better performance in one go. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Dan van der Ster <d...@vanderster.com> > Sent: 29 November 2021 09:24:29 > To: Frank Schilder > Cc: huxia...@horebdata.cn; YiteGu; ceph-users > Subject: Re: [ceph-users] Re: Rocksdb: Corruption: missing start of > fragmented record(1) > > Hi Frank, > > That's true from the performance perspective, however it is not unsafe > to leave the cache enabled -- ceph uses fsync appropriately to make > the writes durable. > > This issue looks rather to be related to concurrent hardware failure. > > Cheers, Dan > > On Mon, Nov 29, 2021 at 9:21 AM Frank Schilder <fr...@dtu.dk> wrote: > > > > This may sound counter-intuitive, but you need to disable write cache to > > enable PLP cache only. SSDs with PLP have usually 2 types of cache, > > volatile and non-volatile. The volatile cache will experience data loss on > > power loss. It is the volatile cache that gets disabled when issuing the > > hd-/sdparm/smartctl command to switch it off. In many cases this can > > increase the non-volatile cache and also performance. > > > > It is the non-volatile cache you want your writes to go to directly. > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: huxia...@horebdata.cn <huxia...@horebdata.cn> > > Sent: 26 November 2021 22:41:10 > > To: YiteGu; ceph-users > > Subject: [ceph-users] Re: Rocksdb: Corruption: missing start of fragmented > > record(1) > > > > wal/db are on Intel S4610 960GB SSDs, with PLP and write back on > > > > > > > > huxia...@horebdata.cn > > > > From: YiteGu > > Date: 2021-11-26 11:32 > > To: huxia...@horebdata.cn; ceph-users > > Subject: Re:[ceph-users] Rocksdb: Corruption: missing start of fragmented > > record(1) > > It look like your wal/db device loss data. > > please check your wal/db device whether have writeback cache, and power > > loss cause data loss. replay log failure when rocksdb restart. > > > > > > > > YiteGu > > ess_...@qq.com > > > > > > > > ------------------ Original ------------------ > > From: "huxia...@horebdata.cn" <huxia...@horebdata.cn>; > > Date: Fri, Nov 26, 2021 06:02 PM > > To: "ceph-users"<ceph-users@ceph.io>; > > Subject: [ceph-users] Rocksdb: Corruption: missing start of fragmented > > record(1) > > > > Dear Cephers, > > > > I just had one Ceph osd node (Luminous 12.2.13) power-loss unexpectedly, > > and after restarting that node, two OSDs out of 10 can not be started, > > issuing the following errors (see below image), in particular, i see > > > > Rocksdb: Corruption: missing start of fragmented record(1) > > Bluestore(/var/lib/ceph/osd/osd-21) _open_db erroring opening db: > > ... > > **ERROR: OSD init failed: (5) Input/output error > > > > I checked the db/val SSDs, and they are working fine. So I am wondering the > > following > > 1) Is there a method to restore the OSDs? > > 2) what could be the potential causes of the corrupted db/wal? The db/wal > > SSDs have PLP and not been damaged during the power loss > > > > Your help would be highly appreciated. > > > > best regards, > > > > samuel > > > > > > > > > > huxia...@horebdata.cn > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io