Hi, rocksdb in BlueStore should be opened like this with ceph-kvstore-tool:
ceph-kvstore-tool bluestore-kv Instead of just "rocksdb" which is for rocksdb on some file system. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Sun, Jun 30, 2019 at 2:49 PM Christian Wahl <w...@teco.edu> wrote: > Hi all, > > we are running a pretty small instance of Ceph (v13.2.6) with 1 host and 8 > OSDs and are planning to expand to a more default setup with 3 hosts and > more OSDs. > > However tonight one of our redudant PSUs died and it did failover, but it > looks like this has corrupted 3 out of 8 OSDs. > The pools all have a replication level of 2. > > All OSDs are BlueStore with rocksdb, no external journal or wal > > 2 of them report a missing rocksdb: > > Jun 30 01:33:32 tecoceph systemd[1]: Starting Ceph object storage daemon > osd.3... > Jun 30 01:33:32 tecoceph systemd[1]: Started Ceph object storage daemon > osd.3. > Jun 30 01:33:32 tecoceph ceph-osd[11431]: 2019-06-30 01:33:32.242 > 7f2666a75d80 -1 Public network was set, but cluster network was not set > Jun 30 01:33:32 tecoceph ceph-osd[11431]: 2019-06-30 01:33:32.242 > 7f2666a75d80 -1 Using public network also for cluster network > Jun 30 01:33:32 tecoceph ceph-osd[11431]: starting osd.3 at - osd_data > /var/lib/ceph/osd/ceph-3 /var/lib/ceph/osd/ceph-3/journal > Jun 30 01:33:32 tecoceph ceph-osd[11431]: 2019-06-30 01:33:32.898 > 7f2666a75d80 -1 rocksdb: NotFound: > Jun 30 01:33:32 tecoceph ceph-osd[11431]: 2019-06-30 01:33:32.898 > 7f2666a75d80 -1 bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring > opening db: > Jun 30 01:33:33 tecoceph ceph-osd[11431]: 2019-06-30 01:33:33.267 > 7f2666a75d80 -1 osd.3 0 OSD:init: unable to mount object store > Jun 30 01:33:33 tecoceph ceph-osd[11431]: 2019-06-30 01:33:33.267 > 7f2666a75d80 -1 ** ERROR: osd init failed: (5) Input/output error > Jun 30 01:33:33 tecoceph systemd[1]: ceph-osd@3.service: main process > exited, code=exited, status=1/FAILURE > > So I tried working with the bluestore-tool > > [root@tecoceph osd]# ceph-bluestore-tool show-label --path > /var/lib/ceph/osd/ceph-3/ > inferring bluefs devices from bluestore path > { > "/var/lib/ceph/osd/ceph-3//block": { > "osd_uuid": "c28c092c-00aa-4db0-9925-642bf99f0662", > "size": 8001561821184, > "btime": "2018-05-28 22:44:58.712336", > "description": "main", > "bluefs": "1", > "ceph_fsid": "a9493143-3e4e-450e-b3b8-28508d48d412", > "kv_backend": "rocksdb", > "magic": "ceph osd volume v026", > "mkfs_done": "yes", > "osd_key": "AQBG************************", > "ready": "ready", > "whoami": "3" > } > } > > [root@tecoceph osd]# ceph-bluestore-tool fsck --deep yes --path > /var/lib/ceph/osd/ceph-3/ > 2019-06-30 14:38:35.998 7f9947432940 -1 rocksdb: NotFound: > 2019-06-30 14:38:35.998 7f9947432940 -1 > bluestore(/var/lib/ceph/osd/ceph-3/) _open_db erroring opening db: > error from fsck: (5) Input/output error > > Trying to access the rocksdb with the kvstore-tool fails as well > [root@tecoceph osd]# ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-3 > list > 2019-06-30 14:39:36.021 7faa747e8a80 1 rocksdb: do_open column families: > [] > failed to open type 2019-06-30 14:39:36.022 7faa747e8a80 -1 rocksdb: > Invalid argument: /var/lib/ceph/osd/ceph-3: does not exist > (create_if_missing is false) > rocksdb path /var/lib/ceph/osd/ceph-3: (22) Invalid argument > > Repairing it with the kvstore-tool results in a segmentation fault… > [root@tecoceph osd]# ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-3 > repair > *** Caught signal (Segmentation fault) ** > in thread 7ff8fde03a80 thread_name:ceph-kvstore-to > ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic > (stable) > 1: (()+0xf5d0) [0x7ff8f23925d0] > 2: (main()+0x2c4) [0x55ae6dadb4e4] > 3: (__libc_start_main()+0xf5) [0x7ff8f0d673d5] > 4: (()+0x21dde0) [0x55ae6dbafde0] > 2019-06-30 14:39:15.785 7ff8fde03a80 -1 *** Caught signal (Segmentation > fault) ** > in thread 7ff8fde03a80 thread_name:ceph-kvstore-to > > ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic > (stable) > 1: (()+0xf5d0) [0x7ff8f23925d0] > 2: (main()+0x2c4) [0x55ae6dadb4e4] > 3: (__libc_start_main()+0xf5) [0x7ff8f0d673d5] > 4: (()+0x21dde0) [0x55ae6dbafde0] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > The other one crashes with a segfault and any tool other as well, because > of a wrong magic number > Jun 30 01:32:29 tecoceph ceph-osd[8661]: -324> 2019-06-30 01:32:27.805 > 7fa9bd453d80 -1 Public network was set, but cluster network was not set > Jun 30 01:32:29 tecoceph ceph-osd[8661]: -324> 2019-06-30 01:32:27.805 > 7fa9bd453d80 -1 Using public network also for cluster network > Jun 30 01:32:29 tecoceph ceph-osd[8661]: -324> 2019-06-30 01:32:29.771 > 7fa9bd453d80 -1 abort: Corruption: Bad table magic number: expected > 9863518390377041911, found 15656361161312523986 in db/002923.sst > Jun 30 01:32:29 tecoceph ceph-osd[8661]: -324> 2019-06-30 01:32:29.831 > 7fa9bd453d80 -1 *** Caught signal (Aborted) ** > > Is there any way to recover any of these OSDs? > > Karlsruhe Institute of Technology (KIT) > Pervasive Computing Systems – TECO > Prof. Dr. Michael Beigl > IT > Christian Wahl > > Vincenz-Prießnitz-Str. 1 > Building 07.07., 2nd floor > 76131 Karlsruhe, Germany > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com