Hi all,

We've got a containerized test cluster with 3 OSDs and ~ 220GiB of data. 
Shortly after upgrading from nautilus -> octopus, 2 of the 3 OSDs have started 
flapping. I've also got alarms about the MDS being damaged, which we've seen 
elsewhere and have a recovery process for, but I'm unable to run this (I 
suspect because I've only got 1 functioning OSD). My RGWs are also failing to 
start, again I suspect because of the bad state of OSDs. I've tried restarting 
all OSDs, rebooting all servers, checked auth (all looks fine) - but I'm still 
in the same state.

My OSDs seem to be failing at the  "_open_alloc opening allocation metadata" 
step; looking at logs for each OSD restart, the OSD writes this log, then no 
logs for a few minutes and then logs:

    bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 132 GiB in 2930776 
extents available 113 GiB
    rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work

After that we seem to try starting up in a slightly different state and get a 
different set of errors:

    bluefs _allocate failed to allocate 0x100716 on bdev 1, free 0xd0000; 
fallback to bdev 2
    bluefs _allocate unable to allocate 0x100716 on bdev 2, free 
0xffffffffffffffff; fallback to slow device expander

and eventually crash and log a heap of stack dumps.

I don't know what extents are but I seem to have a lot of them, and more than 
I've got capacity for? Maybe I'm running out of RAM or disk space somewhere, 
but I've got 21GB of free RAM on the server, and each OSD has a 350GiB device 
attached to it.



I'm wondering if anyone has seen anything like this before or can suggest next 
debug steps to take?

Cheers,

Dave



Full OSD logs surrounding the "_open_alloc opening allocation metadata" step:


Jul 23 00:07:13 condor_sc0 container_name/ceph-osd-1[1709]: 
2021-07-23T00:07:13.818+0000 7f3de111bf40  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1626998833819439, "job": 1, "event": "recovery_started", 
"log_files": [392088, 392132]}

Jul 23 00:07:13 condor_sc0 container_name/ceph-osd-1[1709]: 
2021-07-23T00:07:13.818+0000 7f3de111bf40  4 rocksdb: [db/db_impl_open.cc:583] 
Recovering log #392088 mode 0

Jul 23 00:07:17 condor_sc0 container_name/ceph-osd-1[1709]: 
2021-07-23T00:07:17.240+0000 7f3de111bf40  4 rocksdb: [db/db_impl_open.cc:583] 
Recovering log #392132 mode 0

Jul 23 00:07:17 condor_sc0 container_name/ceph-osd-1[1709]: 
2021-07-23T00:07:17.486+0000 7f3de111bf40  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1626998837486404, "job": 1, "event": "recovery_finished"}

Jul 23 00:07:17 condor_sc0 container_name/ceph-osd-1[1709]: 
2021-07-23T00:07:17.486+0000 7f3de111bf40  1 
bluestore(/var/lib/ceph/osd/ceph-1) _open_db opened rocksdb path db options 
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2

Jul 23 00:07:17 condor_sc0 container_name/ceph-osd-1[1709]: 
2021-07-23T00:07:17.524+0000 7f3de111bf40  1 freelist init

Jul 23 00:07:17 condor_sc0 container_name/ceph-osd-1[1709]: 
2021-07-23T00:07:17.524+0000 7f3de111bf40  1 freelist _init_from_label

Jul 23 00:07:17 condor_sc0 container_name/ceph-osd-1[1709]: 
2021-07-23T00:07:17.529+0000 7f3de111bf40  1 
bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation metadata

Jul 23 00:07:18 condor_sc0 container_name/ceph-osd-1[1709]: 
2021-07-23T00:07:18.238+0000 7f3de111bf40  1 HybridAllocator _spillover_range 
constructing fallback allocator

Jul 23 00:07:20 condor_sc0 container_name/ceph-osd-1[1709]: 
2021-07-23T00:07:20.563+0000 7f3de111bf40  1 
bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 132 GiB in 2930776 
extents available 113 GiB

Jul 23 00:07:20 condor_sc0 container_name/ceph-osd-1[1709]: 
2021-07-23T00:07:20.563+0000 7f3de111bf40  4 rocksdb: [db/db_impl.cc:390] 
Shutdown: canceling all background work

Jul 23 00:07:20 condor_sc0 container_name/ceph-osd-1[1709]: 
2021-07-23T00:07:20.565+0000 7f3de111bf40  4 rocksdb: [db/db_impl.cc:563] 
Shutdown complete

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to