Hello, I've been facing some issues with a single node ceph cluster (mimic). I know an environment like this shouldn't be in production but the server end up dealing with operational workloads for the last 2 years.
Some users detected some issues in cephfs; some files not being accessible and hanging the node while trying to list the content of affected folders. I noticed a heavy memory load on the server. Main memory was consumed by cache as well as quite a reasonable swap. The command "ceph health detail" reported some inactive PGs. Those PGs didn't exist. After rebooting the node, an fsck was run in the 3 affected OSDs. ceph-bluestore-tool fsck --deep yes --path /var/lib/ceph/osd/ceph-1/ Unfortunately, all of them crashed with a core dump and now they don't start anymore. The logs report messages like: 2019-08-28 03:00:12.999 7f21d787c240 4 rocksdb: [/build/ceph-13.2.1/src/rocksdb/db/version_set.cc:3088] Recovering from manifest file: MANIFEST-004059 2019-08-28 03:00:12.999 7f21d787c240 4 rocksdb: [/build/ceph-13.2.1/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all background work 2019-08-28 03:00:12.999 7f21d787c240 4 rocksdb: [/build/ceph-13.2.1/src/rocksdb/db/db_impl.cc:397] Shutdown complete 2019-08-28 03:00:12.999 7f21d787c240 -1 rocksdb: NotFound: 2019-08-28 03:00:12.999 7f21d787c240 -1 bluestore(/var/lib/ceph/osd/ceph-0) _open_db erroring opening db: 2019-08-28 03:00:12.999 7f21d787c240 1 bluefs umount 2019-08-28 03:00:12.999 7f21d787c240 1 stupidalloc 0x0x5650c5255800 shutdown 2019-08-28 03:00:12.999 7f21d787c240 1 bdev(0x5650c5604a80 /var/lib/ceph/osd/ceph-0/block) close 2019-08-28 03:00:13.247 7f21d787c240 1 bdev(0x5650c5604700 /var/lib/ceph/osd/ceph-0/block) close 2019-08-28 03:00:13.479 7f21d787c240 -1 osd.0 0 OSD:init: unable to mount object store 2019-08-28 03:00:13.479 7f21d787c240 -1 ** ERROR: osd init failed: (5) Input/output error I'm not sure if the fsck has introduced additional damage. After that, I tried to mark unfound as lost with the following commands: ceph pg 4.1e mark_unfound_lost revert ceph pg 9.1d mark_unfound_lost revert ceph pg 13.3 mark_unfound_lost revert ceph pg 13.e mark_unfound_lost revert Currently, since there are 3 OSD down, there are: 316 unclean PGs 76 inactive PGs root@ceph-s01:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -2 0.43599 root ssd -4 0.43599 disktype ssd_disk 12 ssd 0.43599 osd.12 up 1.00000 1.00000 -1 60.03792 root default -5 60.03792 disktype hdd_disk 0 hdd 0 osd.0 down 1.00000 1.00000 1 hdd 5.45799 osd.1 down 0 1.00000 2 hdd 5.45799 osd.2 up 1.00000 1.00000 3 hdd 5.45799 osd.3 up 1.00000 1.00000 4 hdd 5.45799 osd.4 up 1.00000 1.00000 5 hdd 5.45799 osd.5 up 1.00000 1.00000 6 hdd 5.45799 osd.6 up 1.00000 1.00000 7 hdd 5.45799 osd.7 down 0 1.00000 8 hdd 5.45799 osd.8 up 1.00000 1.00000 9 hdd 5.45799 osd.9 up 1.00000 1.00000 10 hdd 5.45799 osd.10 up 1.00000 1.00000 11 hdd 5.45799 osd.11 up 1.00000 1.00000 Running the following command, a MANIFEST file appeared in the folder db/lost. I guess that the repair moved here. # ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-7 --out-dir osd7/ ... db/LOCK db/MANIFEST-000001 db/OPTIONS-018543 db/OPTIONS-018581 db/lost/ db/lost/MANIFEST-018578 Any ideas? Suggestions? Thank you. Regards, Jordi
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io