Hi Theo,

Sorry I can't tell for sure what would marking OSD lost do with its encryption keys. Likely - yes, they'll be lost.

But instead of going this way I'd rather suggest you to add another two OSDs and let Ceph recover more PGs replicas into them.

Just in case - didn't you overwrite existing PG replicas  at target OSDs when exporting PGs back to OSDs 1&3? The same PG can't have two replicas/shards at a single OSD while your OSD count is pretty limited.. Just curious for now - that still shouldn't be an issue given you have at least 2 replicas/shards for all the pools anyway.

Thanks,

Igor

On 2/21/2026 11:00 PM, Theo Cabrerizo Diem via ceph-users wrote:
Hello Igor,

First of all, sorry about the late reply. It took me a while to export all
shards that weren't available from the osd.2 (1 and 3 were fine, 2 didn't
start but i could use `ceph-objectstore-tool ... --op list-pgs` while osd.0
I couldn't even list the pgs, it threw an error right away - more about it
later in the email)

Two of the unavailable shards, when exporting, ceph-objectstore-tool core
dumped with the same issue in the rocksdb, but I should have enough chunks
to not need them - just mentioning in case is useful:

sh-5.1# ceph-objectstore-tool --data-path /var/lib/ceph/osd --pgid 11.19s2
--op export --file pg.11.19s2.dat
/ceph/rpmbuild/BUILD/ceph-20.2.0/src/kv/RocksDBStore.cc: In function
'virtual int RocksDBStore::get(const std::string&, const std::string&,
ceph::bufferlist*)' thread 7ff3be4ca800 time 2026-02-04T09:42:00.743877+0000
/ceph/rpmbuild/BUILD/ceph-20.2.0/src/kv/RocksDBStore.cc: 1961:
ceph_abort_msg("block checksum mismatch: stored = 246217859, computed =
2155741315, type = 4  in db/170027.sst offset 28264757 size 1417")
  ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle
(stable - RelWithDebInfo)
  1: (ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xc9) [0x7ff3bf5391fd]
  2: (RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3bc)
[0x555667b340bc]
  3:
(BlueStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
ghobject_t const&,
std::set<std::__cxx11::basic_string<char,std::char_traits<char>,
std::allocator<char> >, std::less<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > >,
std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > > > const&,
std::map<std::__cxx11::basic_string<char,std::char_traits<char>,
std::allocator<char> >, ceph::buffer::v15_2_0::list,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::pair<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const,
ceph::buffer::v15_2_0::list> > >*)+0x401) [0x555667a25fe1]
  4: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*)+0x361)
[0x5556675e0101]
  5: main()
  6: /lib64/libc.so.6(+0x2a610) [0x7ff3be930610]
  7: __libc_start_main()
  8: _start()
*** Caught signal (Aborted) **
  in thread 7ff3be4ca800 thread_name:ceph-objectstor
  ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle
(stable - RelWithDebInfo)
  1: /lib64/libc.so.6(+0x3fc30) [0x7ff3be945c30]
  2: /lib64/libc.so.6(+0x8d03c) [0x7ff3be99303c]
  3: raise()
  4: abort()
  5: (ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x186) [0x7ff3bf5392ba]
  6: (RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3bc)
[0x555667b340bc]
  7:
(BlueStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
ghobject_t const&,
std::set<std::__cxx11::basic_string<char,std::char_traits<char>,
std::allocator<char> >, std::less<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > >,
std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > > > const&,
std::map<std::__cxx11::basic_string<char,std::char_traits<char>,
std::allocator<char> >, ceph::buffer::v15_2_0::list,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::pair<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const,
ceph::buffer::v15_2_0::list> > >*)+0x401) [0x555667a25fe1]
  8: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*)+0x361)
[0x5556675e0101]
  9: main()
  10: /lib64/libc.so.6(+0x2a610) [0x7ff3be930610]
  11: __libc_start_main()
  12: _start()
Aborted (core dumped)



After importing all shards that I could recover that weren't available, I
don't have any "unknown" pgs anymore. I still have lots of PGs in "down"
state, which I assume I need to flag both "dead" OSDs as lost to unstuck
them. Since it is an operation I cannot go back, I would like to confirm
that is indeed the correct next step to take.

I have a few questions to understand "what happens" in the next step
(marking osd as lost?):

Shall I assume that once I flag an OSD as lost, I won't be able to
"activate" it since I use encryption when initializing the bluestore OSD,
or flagging them as lost won't destroy their unlocking keys? (which means
any hope of further extracting data to be gone, mostly on the osd.0 which I
couldn't use ceph-objectstore-tool at all since the power loss).

I think I should have all the shards from the PGs but just in case, I've
managed to make a clone of the osd.0 on a different physical disk (the
other reason I took long to answer). But still ceph-objectstore-tool
refuses to run:

# ceph-objectstore-tool --data-path /var/lib/ceph/osd --op list-pgs
Mount failed with '(5) Input/output error'

# ls -l /var/lib/ceph/osd
total 28
lrwxrwxrwx 1 ceph ceph  50 Feb  4 08:26 block ->
/dev/mapper/zNPZJR-i0TZ-6NtK-URto-tjfs-iJRb-GCAYEm
-rw------- 1 ceph ceph  37 Feb  4 08:26 ceph_fsid
-rw------- 1 ceph ceph  37 Feb  4 08:26 fsid
-rw------- 1 ceph ceph  55 Feb  4 08:26 keyring
-rw------- 1 ceph ceph 106 Jan 24 00:44 lockbox.keyring
-rw------- 1 ceph ceph   6 Feb  4 08:26 ready
-rw------- 1 ceph ceph  10 Feb  4 08:26 type
-rw------- 1 ceph ceph   2 Feb  4 08:26 whoami

Just as information, all except 2 pools in my cluster are "replicated".
Pools id 11 and 16 are erasure coded (2+1). If I understood correctly, as
long as I have two acting shards (and at most one "NONE"), data should be
available (at least in read-only) once I mark the down OSDs as lost. Is
that understanding correct?

Another information, pools 10 and 15 are the "replicated root pools" before
the erasure coded pools were created.

Ignoring osd.0 for now, here are the current state of my cluster (mds is
intentionally not started while I try to fix the PGs):
### ceph osd lspools
3 .rgw.root
4 default.rgw.log
5 default.rgw.control
6 default.rgw.meta
10 ark.data
11 ark.data_ec
12 ark.metadata
14 .mgr
15 limbo
16 limbo.data_ec
18 default.rgw.buckets.index
19 default.rgw.buckets.data
###

### ceph health
# ceph -s
   cluster:
     id:     021f058f-dbf3-4a23-adb5-21d83f3f1bb6
     health: HEALTH_ERR
             1 filesystem is degraded
             1 filesystem has a failed mds daemon
             1 filesystem is offline
             insufficient standby MDS daemons available
             Reduced data availability: 143 pgs inactive, 143 pgs down
             Degraded data redundancy: 1303896/7149898 objects degraded
(18.237%), 218 pgs degraded, 316 pgs undersized
             144 pgs not deep-scrubbed in time
             459 pgs not scrubbed in time
             256 slow ops, oldest one blocked for 1507794 sec, osd.1 has
slow ops
             too many PGs per OSD (657 > max 500)

   services:
     mon: 2 daemons, quorum ceph-ymir-mon2,ceph-ymir-mon1 (age 2w)
     mgr: ceph-ymir-mgr1(active, since 2w)
     mds: 0/1 daemons up (1 failed)
     osd: 4 osds: 2 up (since 29m), 2 in (since 4w); 24 remapped pgs

   data:
     volumes: 0/1 healthy, 1 failed
     pools:   12 pools, 529 pgs
     objects: 2.46M objects, 7.4 TiB
     usage:   8.3 TiB used, 13 TiB / 22 TiB avail
     pgs:     27.032% pgs not active
              1303896/7149898 objects degraded (18.237%)
              306628/7149898 objects misplaced (4.289%)
              218 active+undersized+degraded
              143 down
              98  active+undersized
              45  active+clean
              19  active+clean+remapped
              4   active+clean+remapped+scrubbing+deep
              1   active+clean+remapped+scrubbing
              1   active+clean+scrubbing+deep
### ceph health

### ceph health detail
# ceph health detail
HEALTH_ERR 1 filesystem is degraded; 1 filesystem has a failed mds daemon;
1 filesystem is offline; insufficient standby
  MDS daemons available; Reduced data availability: 143 pgs inactive, 143
pgs down; Degraded data redundancy: 1303896/714
9898 objects degraded (18.237%), 218 pgs degraded, 316 pgs undersized; 144
pgs not deep-scrubbed in time; 459 pgs not sc
rubbed in time; 256 slow ops, oldest one blocked for 1508207 sec, osd.1 has
slow ops; too many PGs per OSD (657 > max 50
0)
[WRN] FS_DEGRADED: 1 filesystem is degraded
     fs ark is degraded
[WRN] FS_WITH_FAILED_MDS: 1 filesystem has a failed mds daemon
     fs ark has 1 failed mds
[ERR] MDS_ALL_DOWN: 1 filesystem is offline
     fs ark is offline because no MDS is active for it.
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
     have 0; want 1 more
[WRN] PG_AVAILABILITY: Reduced data availability: 143 pgs inactive, 143 pgs
down
     pg 10.11 is down, acting [1,3]
     pg 10.18 is down, acting [3,1]
     pg 10.1d is down, acting [1,3]
     pg 10.1f is down, acting [1,3]
     pg 11.10 is down, acting [3,1,NONE]
     pg 11.12 is down, acting [1,NONE,3]
     pg 11.18 is stuck inactive for 4w, current state down, last acting
[1,3,NONE]
     pg 11.19 is down, acting [3,1,NONE]
     pg 11.1b is down, acting [1,NONE,3]
     pg 11.62 is down, acting [NONE,3,1]
     pg 11.63 is down, acting [3,NONE,1]
     pg 11.64 is down, acting [NONE,1,3]
     pg 11.66 is down, acting [NONE,3,1]
     pg 11.67 is down, acting [1,NONE,3]
     pg 11.68 is down, acting [3,NONE,1]
     pg 11.69 is down, acting [NONE,1,3]
     pg 11.6a is down, acting [1,NONE,3]
     pg 11.6b is down, acting [NONE,1,3]
     pg 11.6f is down, acting [NONE,3,1]
     pg 11.71 is down, acting [1,3,NONE]
     pg 11.72 is down, acting [1,3,NONE]
     pg 11.74 is down, acting [NONE,3,1]
     pg 11.76 is down, acting [1,NONE,3]
     pg 11.78 is down, acting [3,1,NONE]
     pg 11.7d is down, acting [NONE,3,1]
     pg 11.7e is down, acting [NONE,1,3]
     pg 15.15 is down, acting [1,3]
     pg 15.16 is down, acting [3,1]
     pg 15.17 is down, acting [1,3]
     pg 15.1a is down, acting [3,1]
     pg 16.1 is down, acting [1,3,NONE]
     pg 16.4 is down, acting [1,3,NONE]
     pg 16.b is down, acting [3,NONE,1]
     pg 16.60 is down, acting [3,1,NONE]
     pg 16.61 is down, acting [3,1,NONE]
     pg 16.62 is down, acting [3,NONE,1]
     pg 16.63 is down, acting [3,NONE,1]
     pg 16.65 is down, acting [NONE,3,1]
     pg 16.67 is down, acting [1,NONE,3]
     pg 16.68 is down, acting [1,NONE,3]
     pg 16.69 is down, acting [3,1,NONE]
     pg 16.6a is down, acting [1,3,NONE]
     pg 16.6c is down, acting [1,3,NONE]
     pg 16.70 is down, acting [3,NONE,1]
     pg 16.73 is down, acting [3,NONE,1]
     pg 16.74 is down, acting [1,3,NONE]
     pg 16.75 is down, acting [3,1,NONE]
     pg 16.79 is down, acting [3,NONE,1]
     pg 16.7a is down, acting [1,3,NONE]
     pg 16.7e is down, acting [1,3,NONE]
     pg 16.7f is down, acting [3,NONE,1]
[WRN] PG_DEGRADED: Degraded data redundancy: 1303896/7149898 objects
degraded (18.237%), 218 pgs degraded, 316 pgs under
sized
     pg 3.18 is stuck undersized for 36m, current state active+undersized,
last acting [1,3]
...<snipped for brevity>
###

Once again, I cannot thank you enough for looking into my issue.
I have the impression that being able to recover the data I need is just
around the corner. Although the croit.io blog did mention flagging the osd
as lost, I would like to double check it to avoid losing any possibility to
recover the data.

If there's anything further I could check or if you need full output of the
commands, let me know.

Thanks in advance.

On Tue, 3 Feb 2026 at 10:26, Igor Fedotov <[email protected]> wrote:

Hi Theo,

you might want to try to use PG export/import using ceph-objectstore-tool.

Please find more details here
https://www.croit.io/blog/how-to-recover-inactive-pgs-using-ceph-objectstore-tool-on-ceph-clusters


Thanks,

Igor
On 03/02/2026 02:38, Theo Cabrerizo Diem via ceph-users wrote:

:12:18.895+0000 7f0c543eac00 -1 bluestore(/var/lib/ceph/osd)
fsck error: free extent 0x1714c521000~978b26df000 intersects allocatedblocks
fsck status: remaining 1 error(s) and warning(s)


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to