Hi Igor,

I let ceph rebuild the OSD.7. Then I added 
```
[osd]
        debug bluefs = 20
        debug bdev = 20
        debug bluestore = 20
```
to the ceph.conf of all 3 nodes and shut down all 3 nodes without writing 
anything to the pools on the HDDs (the Debian VM was not even running).
Immediately at the first boot OSD.5 and 6 crashed with the same “Bad table 
magic number” error. The OSDs 5 and 6 are on the same node, but not on the node 
of OSD 7, wich crashed the last two times.

Logs and corrupted rocks DB Files: https://we.tl/t-ZBXYp8r4Hq
I have saved the entire /var/log directory of every node and the result of 
```
$ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-5 --out-dir 
/tmp/osd.5-data
$ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-6 --out-dir 
/tmp/osd.6-data
```
Let me know if you need something else. 


I hop you can now track it down. I'm really looking forward to your 
interpretation of the logs.


Best Regards,
Sebastian


> On 22.02.2022, at 11:44, Igor Fedotov <igor.fedo...@croit.io> wrote:
> 
> Hi Sebastian,
> 
> On 2/22/2022 3:01 AM, Sebastian Mazza wrote:
>> Hey Igor!
>> 
>> 
>>> thanks a lot for the new logs - looks like they provides some insight.
>> I'm glad the logs are helpful.
>> 
>> 
>>> At this point I think the root cause is apparently a race between deferred 
>>> writes replay and some DB maintenance task happening on OSD startup. It 
>>> seems that deferred write replay updates a block extent which 
>>> RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros 
>>> content. Evidently that's just a matter of chance whether they use 
>>> conflicting physical extent or not hence the occasional nature of the 
>>> issue...
>> 
>> Do I understand that correct: The corruption of the rocksDB (Table 
>> overwritten by zeros) happens at the first start of the OSD after  “*** 
>> Immediate shutdown (osd_fast_shutdown=true) ***”? Before the system launches 
>> the OSD Service the RocksDB is still fine?
> Looks like that. From logs I can see an unexpected write to specific extent 
> (LBA 0x630000) which shouldn't occur and at which RocksDB subsequently fails.
>> 
>> 
>>> So first of all I'm curious if you have any particular write patterns that 
>>> can be culprits? E.g. something like disk wiping procedure which writes 
>>> all-zeros to an object followed by object truncate or removal comes to my 
>>> mind. If you can identify something like that - could you please collect 
>>> OSD log for such an operation (followed by OSD restart) with 
>>> debug-bluestore set to 20?
>> Best to my knowledge the OSD was hardly doing anything and I do not see any 
>> pattern that would fit to you explanation.
>> However, you certainly understand a lot more about it than I do, so I try to 
>> explain everything that could be relevant.
>> 
>> The Cluster has 3 Nodes. Each has a 240GB NVMe m.2 SSD as boot drive, which 
>> should not be relevant. Each node has 3 OSDs, one is on an U.2 NVMe SSD with 
>> 2TB and the other two are on 12TB HDDs.
>> 
>> I have configured two crush rules ‘c3nvme’ and ‘ec4x2hdd’. The ‘c3nvme’ is a 
>> replicated rule that uses only OSDs with class ’nvme’. The second rule is a 
>> tricky erasure rule. It selects exactly 2 OSDs on exactly 4 Hosts with class 
>> ‘hdd’. So it only works for a size of exactly 8. That means that a pool that 
>> uses this rule has always only “undersized” placement groups, since the 
>> cluster has only 3 nodes. (I did not add the fourth server after the first 
>> crash in December, since we want to reproduce the problem.)
>> 
>> The pools device_health_metrics, test-pool, fs.metadata-root-pool, 
>> fs.data-root-pool, fs.data-nvme.c-pool, and block-nvme.c-pool uses the crush 
>> rule ‘c3nvme’ with a size of 3 and a min size of 2. The pools 
>> fs.data-hdd.ec-pool, block-hdd.ec-pool uses the crush rule ‘ec4x2hdd’ with 
>> k=5,m=3 and a min size of 6.
>> 
>> The pool fs.data-nvme.c-pool is not used and the pool test-pool was used for 
>> rados bench a few month ago.
>> 
>> The pool fs.metadata-root-pool is used as metadata pool for cephFS and 
>> fs.data-root-pool as the root data pool for the cephFS. The pool 
>> fs.data-hdd.ec-pool is an additional data pool for the cephFS and is 
>> specified as ceph.dir.layout for some folders of the cephFS. The whole 
>> cephFS is mounted by each of the 3 nodes.
>> 
>> The pool block-nvme.c-pool hosts two RBD images that are used as boot drives 
>> for two VMs. The first VM runes with Ubuntu Desktop and the second with 
>> Debian as OS. The pool block-hdd.ec-pool hosts one RBD image (the data part, 
>> metadata on block-nvme.c-pool) that is attached to the Debian VM as second 
>> drive formatted with BTRFS. Furthermore the Debian VM mounts a sub directory 
>> of the cephFS that has the fs.data-hdd.ec-pool set as layout. Both VMs was 
>> doing nothing, except from being booted, in the last couple of days.
>> 
>> I try to illustrate the pool usage as a tree:
>> * c3nvme (replicated, size=3, min_size=2)
>>     + device_health_metrics
>>     + test-pool
>>         - rados bench
>>     + fs.metadata-root-pool
>>         - CephFS (metadata)
>>     + fs.data-root-pool
>>         - CephFS (data root)
>>     + fs.data-nvme.c-pool
>>     + block-nvme.c-pool
>>         - RBD (Ubuntu VM, boot disk with ext4)
>>      - RBD (Debian VM, boot disk with ext4)
>> * ec4x2hdd (ec,  k=5, m=3, min_size=6)
>>     + fs.data-hdd.ec-pool
>>         - CephFS (layout data)
>>     + block-hdd.ec-pool
>>         - RBD (Debian VM, disk 2 with BTRFS)
>> 
>> 
>> Last week I was writing about 30TB to the CephFS inside the 
>> fs.data-hdd.ec-pool and around 50GB to the BTRFS volume on the 
>> block-hdd.ec-pool. The 50GB written to BTRFS contains hundred thousands of 
>> hard links. There was already nearly the same data on the storage and I used 
>> rsync to update it. I think something between 1 and 4 TB has changed and was 
>> updated by rsync.
>> 
>> I think the cluster was totally unused on Friday, but up and running and 
>> idling around. Then on Saturday I did a graceful shutdown of of all cluster 
>> nodes. Arround 5 minutes later when booted the servers again, the OSD.7 
>> crashed. I copied the logs and exported the RocksDB. Then I deleted 
>> everything from the HDD and deployed the OSD.7 again. When, I checkt for the 
>> first time today at around 12:00, ceph was already finished with backfilling 
>> to OSD.7 and the cluster was idle again.
>> 
>> I then spend 70 minutes with writing 3 small files (one with about 500Byte 
>> and two with about 2KB) to the cephFS on the fs.data-hdd.ec-pool and 
>> shutting down the whole cluster. After I booted all 3 nodes again I did 
>> overwrite the 3 small files (some Bytes have changed) and shouted down all 3 
>> nodes. I repeated that procedure 4 times within 70 minutes. Then the OSD 7 
>> crashed again.
>> 
>> There may be was some more IO caused by mounting the RBD images with the 
>> BTRFS volume inside the Debian VM. But I do not know. I can only say that no 
>> user did use it. Only I was writing 4 times the 3 small files to cephFS on 
>> the fs.data-hdd.ec-pool.
>> 
>> Does this make any sense to you?
> Well, that's actually not a write pattern I would expect to cause the issue. 
> But thanks a lot anyway. So could you please repeat that procedure with 
> overwriting 3 small files and OSD restart and collect the logs with 
> debug-bluestore set to 20.
>> 
>>> with debug-bluestore set to 20
>> Do you mean
>> ```
>> [osd]
>>      debug bluestore = 20
>> ```
>> in ceph.conf?
> yes.  Or you might use 'ceph config set osd.7 debug-bluestore 20' instead
> 
>> Thanks,
>> Sebastian
>> 
>> 
>> 
>> 
>>> On 21.02.2022, at 21:19, Igor Fedotov <igor.fedo...@croit.io> wrote:
>>> 
>>> Hey Sebastian,
>>> 
>>> thanks a lot for the new logs - looks like they provides some insight. At 
>>> this point I think the root cause is apparently a race between deferred 
>>> writes replay and some DB maintenance task happening on OSD startup. It 
>>> seems that deferred write replay updates a block extent which 
>>> RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros 
>>> content. Evidently that's just a matter of chance whether they use 
>>> conflicting physical extent or not hence the occasional nature of the 
>>> issue...
>>> 
>>> And now I'd like to determine what's wrong with this deferred write replay.
>>> 
>>> So first of all I'm curious if you have any particular write patterns that 
>>> can be culprits? E.g. something like disk wiping procedure which writes 
>>> all-zeros to an object followed by object truncate or removal comes to my 
>>> mind. If you can identify something like that - could you please collect 
>>> OSD log for such an operation (followed by OSD restart) with 
>>> debug-bluestore set to 20?
>>> 
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> On 2/21/2022 5:29 PM, Sebastian Mazza wrote:
>>>> Hi Igor,
>>>> 
>>>> today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And 
>>>> this time I have logs with “debug bluefs = 20” and "debug bdev = 20” for 
>>>> every OSD in the cluster! It was the OSD with the ID 7 again. So the HDD 
>>>> has failed now the third time! Coincidence? Probably not…
>>>> The important thing seams to be that a shutdown and not only a restart of 
>>>> the entire cluster is performed. Since, this time the OSD failed after 
>>>> just 4 shutdowns of all nodes in the cluster within 70 minutes.
>>>> 
>>>> I redeployed the OSD.7 after the crash from 2 days ago. And I started this 
>>>> new shutdown and boot series shortly after ceph had finished writing 
>>>> everything back to OSD.7, earlier today.
>>>> 
>>>> The corrupted RocksDB file (crash) is again only 2KB in size.
>>>> You can download the RocksDB file with the bad  table magic number and the 
>>>> log of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ
>>>> What else do you want?
>>>> 
>>>> From the log of the OSD.7:
>>>> —————
>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 
>>>> /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000
>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 
>>>> /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r 
>>>> 4096 ioc 0x55f0b8dbdd18 with 0 aios left
>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received  signal: Terminated 
>>>> from /sbin/init  (PID: 1) UID: 0
>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal 
>>>> Terminated ***
>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate 
>>>> shutdown (osd_fast_shutdown=true) ***
>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 set uid:gid to 64045:64045 
>>>> (ceph:ceph)
>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 ceph version 16.2.6 
>>>> (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process 
>>>> ceph-osd, pid 1967
>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 pidfile_write: ignore empty 
>>>> --pid-file
>>>> 2022-02-21T13:53:40.459+0100 7fc9645f4f00  1 bdev(0x55bd400a0800 
>>>> /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
>>>> —————
>>>> 
>>>> For me this looks like that the OSD did nothing for nearly 2 minutes 
>>>> before it receives the termination request. Shouldn't this be enough time 
>>>> for flushing every imaginable write cache?
>>>> 
>>>> 
>>>> I hope this helps you.
>>>> 
>>>> 
>>>> Best wishes,
>>>> Sebastian
>>>> 
>>> -- 
>>> Igor Fedotov
>>> Ceph Lead Developer
>>> 
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>> 
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> 
> -- 
> Igor Fedotov
> Ceph Lead Developer
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to