Hi Igor, I let ceph rebuild the OSD.7. Then I added ``` [osd] debug bluefs = 20 debug bdev = 20 debug bluestore = 20 ``` to the ceph.conf of all 3 nodes and shut down all 3 nodes without writing anything to the pools on the HDDs (the Debian VM was not even running). Immediately at the first boot OSD.5 and 6 crashed with the same “Bad table magic number” error. The OSDs 5 and 6 are on the same node, but not on the node of OSD 7, wich crashed the last two times.
Logs and corrupted rocks DB Files: https://we.tl/t-ZBXYp8r4Hq I have saved the entire /var/log directory of every node and the result of ``` $ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-5 --out-dir /tmp/osd.5-data $ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-6 --out-dir /tmp/osd.6-data ``` Let me know if you need something else. I hop you can now track it down. I'm really looking forward to your interpretation of the logs. Best Regards, Sebastian > On 22.02.2022, at 11:44, Igor Fedotov <igor.fedo...@croit.io> wrote: > > Hi Sebastian, > > On 2/22/2022 3:01 AM, Sebastian Mazza wrote: >> Hey Igor! >> >> >>> thanks a lot for the new logs - looks like they provides some insight. >> I'm glad the logs are helpful. >> >> >>> At this point I think the root cause is apparently a race between deferred >>> writes replay and some DB maintenance task happening on OSD startup. It >>> seems that deferred write replay updates a block extent which >>> RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros >>> content. Evidently that's just a matter of chance whether they use >>> conflicting physical extent or not hence the occasional nature of the >>> issue... >> >> Do I understand that correct: The corruption of the rocksDB (Table >> overwritten by zeros) happens at the first start of the OSD after “*** >> Immediate shutdown (osd_fast_shutdown=true) ***”? Before the system launches >> the OSD Service the RocksDB is still fine? > Looks like that. From logs I can see an unexpected write to specific extent > (LBA 0x630000) which shouldn't occur and at which RocksDB subsequently fails. >> >> >>> So first of all I'm curious if you have any particular write patterns that >>> can be culprits? E.g. something like disk wiping procedure which writes >>> all-zeros to an object followed by object truncate or removal comes to my >>> mind. If you can identify something like that - could you please collect >>> OSD log for such an operation (followed by OSD restart) with >>> debug-bluestore set to 20? >> Best to my knowledge the OSD was hardly doing anything and I do not see any >> pattern that would fit to you explanation. >> However, you certainly understand a lot more about it than I do, so I try to >> explain everything that could be relevant. >> >> The Cluster has 3 Nodes. Each has a 240GB NVMe m.2 SSD as boot drive, which >> should not be relevant. Each node has 3 OSDs, one is on an U.2 NVMe SSD with >> 2TB and the other two are on 12TB HDDs. >> >> I have configured two crush rules ‘c3nvme’ and ‘ec4x2hdd’. The ‘c3nvme’ is a >> replicated rule that uses only OSDs with class ’nvme’. The second rule is a >> tricky erasure rule. It selects exactly 2 OSDs on exactly 4 Hosts with class >> ‘hdd’. So it only works for a size of exactly 8. That means that a pool that >> uses this rule has always only “undersized” placement groups, since the >> cluster has only 3 nodes. (I did not add the fourth server after the first >> crash in December, since we want to reproduce the problem.) >> >> The pools device_health_metrics, test-pool, fs.metadata-root-pool, >> fs.data-root-pool, fs.data-nvme.c-pool, and block-nvme.c-pool uses the crush >> rule ‘c3nvme’ with a size of 3 and a min size of 2. The pools >> fs.data-hdd.ec-pool, block-hdd.ec-pool uses the crush rule ‘ec4x2hdd’ with >> k=5,m=3 and a min size of 6. >> >> The pool fs.data-nvme.c-pool is not used and the pool test-pool was used for >> rados bench a few month ago. >> >> The pool fs.metadata-root-pool is used as metadata pool for cephFS and >> fs.data-root-pool as the root data pool for the cephFS. The pool >> fs.data-hdd.ec-pool is an additional data pool for the cephFS and is >> specified as ceph.dir.layout for some folders of the cephFS. The whole >> cephFS is mounted by each of the 3 nodes. >> >> The pool block-nvme.c-pool hosts two RBD images that are used as boot drives >> for two VMs. The first VM runes with Ubuntu Desktop and the second with >> Debian as OS. The pool block-hdd.ec-pool hosts one RBD image (the data part, >> metadata on block-nvme.c-pool) that is attached to the Debian VM as second >> drive formatted with BTRFS. Furthermore the Debian VM mounts a sub directory >> of the cephFS that has the fs.data-hdd.ec-pool set as layout. Both VMs was >> doing nothing, except from being booted, in the last couple of days. >> >> I try to illustrate the pool usage as a tree: >> * c3nvme (replicated, size=3, min_size=2) >> + device_health_metrics >> + test-pool >> - rados bench >> + fs.metadata-root-pool >> - CephFS (metadata) >> + fs.data-root-pool >> - CephFS (data root) >> + fs.data-nvme.c-pool >> + block-nvme.c-pool >> - RBD (Ubuntu VM, boot disk with ext4) >> - RBD (Debian VM, boot disk with ext4) >> * ec4x2hdd (ec, k=5, m=3, min_size=6) >> + fs.data-hdd.ec-pool >> - CephFS (layout data) >> + block-hdd.ec-pool >> - RBD (Debian VM, disk 2 with BTRFS) >> >> >> Last week I was writing about 30TB to the CephFS inside the >> fs.data-hdd.ec-pool and around 50GB to the BTRFS volume on the >> block-hdd.ec-pool. The 50GB written to BTRFS contains hundred thousands of >> hard links. There was already nearly the same data on the storage and I used >> rsync to update it. I think something between 1 and 4 TB has changed and was >> updated by rsync. >> >> I think the cluster was totally unused on Friday, but up and running and >> idling around. Then on Saturday I did a graceful shutdown of of all cluster >> nodes. Arround 5 minutes later when booted the servers again, the OSD.7 >> crashed. I copied the logs and exported the RocksDB. Then I deleted >> everything from the HDD and deployed the OSD.7 again. When, I checkt for the >> first time today at around 12:00, ceph was already finished with backfilling >> to OSD.7 and the cluster was idle again. >> >> I then spend 70 minutes with writing 3 small files (one with about 500Byte >> and two with about 2KB) to the cephFS on the fs.data-hdd.ec-pool and >> shutting down the whole cluster. After I booted all 3 nodes again I did >> overwrite the 3 small files (some Bytes have changed) and shouted down all 3 >> nodes. I repeated that procedure 4 times within 70 minutes. Then the OSD 7 >> crashed again. >> >> There may be was some more IO caused by mounting the RBD images with the >> BTRFS volume inside the Debian VM. But I do not know. I can only say that no >> user did use it. Only I was writing 4 times the 3 small files to cephFS on >> the fs.data-hdd.ec-pool. >> >> Does this make any sense to you? > Well, that's actually not a write pattern I would expect to cause the issue. > But thanks a lot anyway. So could you please repeat that procedure with > overwriting 3 small files and OSD restart and collect the logs with > debug-bluestore set to 20. >> >>> with debug-bluestore set to 20 >> Do you mean >> ``` >> [osd] >> debug bluestore = 20 >> ``` >> in ceph.conf? > yes. Or you might use 'ceph config set osd.7 debug-bluestore 20' instead > >> Thanks, >> Sebastian >> >> >> >> >>> On 21.02.2022, at 21:19, Igor Fedotov <igor.fedo...@croit.io> wrote: >>> >>> Hey Sebastian, >>> >>> thanks a lot for the new logs - looks like they provides some insight. At >>> this point I think the root cause is apparently a race between deferred >>> writes replay and some DB maintenance task happening on OSD startup. It >>> seems that deferred write replay updates a block extent which >>> RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros >>> content. Evidently that's just a matter of chance whether they use >>> conflicting physical extent or not hence the occasional nature of the >>> issue... >>> >>> And now I'd like to determine what's wrong with this deferred write replay. >>> >>> So first of all I'm curious if you have any particular write patterns that >>> can be culprits? E.g. something like disk wiping procedure which writes >>> all-zeros to an object followed by object truncate or removal comes to my >>> mind. If you can identify something like that - could you please collect >>> OSD log for such an operation (followed by OSD restart) with >>> debug-bluestore set to 20? >>> >>> >>> Thanks, >>> >>> Igor >>> >>> On 2/21/2022 5:29 PM, Sebastian Mazza wrote: >>>> Hi Igor, >>>> >>>> today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And >>>> this time I have logs with “debug bluefs = 20” and "debug bdev = 20” for >>>> every OSD in the cluster! It was the OSD with the ID 7 again. So the HDD >>>> has failed now the third time! Coincidence? Probably not… >>>> The important thing seams to be that a shutdown and not only a restart of >>>> the entire cluster is performed. Since, this time the OSD failed after >>>> just 4 shutdowns of all nodes in the cluster within 70 minutes. >>>> >>>> I redeployed the OSD.7 after the crash from 2 days ago. And I started this >>>> new shutdown and boot series shortly after ceph had finished writing >>>> everything back to OSD.7, earlier today. >>>> >>>> The corrupted RocksDB file (crash) is again only 2KB in size. >>>> You can download the RocksDB file with the bad table magic number and the >>>> log of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ >>>> What else do you want? >>>> >>>> From the log of the OSD.7: >>>> ————— >>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 >>>> /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000 >>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 >>>> /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r >>>> 4096 ioc 0x55f0b8dbdd18 with 0 aios left >>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received signal: Terminated >>>> from /sbin/init (PID: 1) UID: 0 >>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal >>>> Terminated *** >>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate >>>> shutdown (osd_fast_shutdown=true) *** >>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00 0 set uid:gid to 64045:64045 >>>> (ceph:ceph) >>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00 0 ceph version 16.2.6 >>>> (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process >>>> ceph-osd, pid 1967 >>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00 0 pidfile_write: ignore empty >>>> --pid-file >>>> 2022-02-21T13:53:40.459+0100 7fc9645f4f00 1 bdev(0x55bd400a0800 >>>> /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block >>>> ————— >>>> >>>> For me this looks like that the OSD did nothing for nearly 2 minutes >>>> before it receives the termination request. Shouldn't this be enough time >>>> for flushing every imaginable write cache? >>>> >>>> >>>> I hope this helps you. >>>> >>>> >>>> Best wishes, >>>> Sebastian >>>> >>> -- >>> Igor Fedotov >>> Ceph Lead Developer >>> >>> Looking for help with your Ceph cluster? Contact us at https://croit.io >>> >>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>> CEO: Martin Verges - VAT-ID: DE310638492 >>> Com. register: Amtsgericht Munich HRB 231263 >>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx > > -- > Igor Fedotov > Ceph Lead Developer > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 > Com. register: Amtsgericht Munich HRB 231263 > Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io