Hi Igor, today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And this time I have logs with “debug bluefs = 20” and "debug bdev = 20” for every OSD in the cluster! It was the OSD with the ID 7 again. So the HDD has failed now the third time! Coincidence? Probably not… The important thing seams to be that a shutdown and not only a restart of the entire cluster is performed. Since, this time the OSD failed after just 4 shutdowns of all nodes in the cluster within 70 minutes.
I redeployed the OSD.7 after the crash from 2 days ago. And I started this new shutdown and boot series shortly after ceph had finished writing everything back to OSD.7, earlier today. The corrupted RocksDB file (crash) is again only 2KB in size. You can download the RocksDB file with the bad table magic number and the log of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ What else do you want? From the log of the OSD.7: ————— 2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000 2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r 4096 ioc 0x55f0b8dbdd18 with 0 aios left 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal Terminated *** 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate shutdown (osd_fast_shutdown=true) *** 2022-02-21T13:53:40.455+0100 7fc9645f4f00 0 set uid:gid to 64045:64045 (ceph:ceph) 2022-02-21T13:53:40.455+0100 7fc9645f4f00 0 ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process ceph-osd, pid 1967 2022-02-21T13:53:40.455+0100 7fc9645f4f00 0 pidfile_write: ignore empty --pid-file 2022-02-21T13:53:40.459+0100 7fc9645f4f00 1 bdev(0x55bd400a0800 /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block ————— For me this looks like that the OSD did nothing for nearly 2 minutes before it receives the termination request. Shouldn't this be enough time for flushing every imaginable write cache? I hope this helps you. Best wishes, Sebastian _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io