[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Sebastian Mazza Mon, 21 Feb 2022 06:30:13 -0800

Hi Igor,

today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And this 
time I have logs with “debug bluefs = 20” and "debug bdev = 20” for every OSD 
in the cluster! It was the OSD with the ID 7 again. So the HDD has failed now 
the third time! Coincidence? Probably not…
The important thing seams to be that a shutdown and not only a restart of the 
entire cluster is performed. Since, this time the OSD failed after just 4 
shutdowns of all nodes in the cluster within 70 minutes.


I redeployed the OSD.7 after the crash from 2 days ago. And I started this new 
shutdown and boot series shortly after ceph had finished writing everything 
back to OSD.7, earlier today. 

The corrupted RocksDB file (crash) is again only 2KB in size.
You can download the RocksDB file with the bad  table magic number and the log 
of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ
What else do you want?

From the log of the OSD.7:
—————
2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 
/var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000
2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 
/var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r 4096 
ioc 0x55f0b8dbdd18 with 0 aios left
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received  signal: Terminated from 
/sbin/init  (PID: 1) UID: 0
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal 
Terminated ***
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate shutdown 
(osd_fast_shutdown=true) ***
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 set uid:gid to 64045:64045 
(ceph:ceph)
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 ceph version 16.2.6 
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process ceph-osd, 
pid 1967
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 pidfile_write: ignore empty 
--pid-file
2022-02-21T13:53:40.459+0100 7fc9645f4f00  1 bdev(0x55bd400a0800 
/var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
—————

For me this looks like that the OSD did nothing for nearly 2 minutes before it 
receives the termination request. Shouldn't this be enough time for flushing 
every imaginable write cache?


I hope this helps you.


Best wishes,
Sebastian

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Reply via email to