That turned out to be exactly the issue (And boy was it fun clearing pgs
out on 71 OSDs). I think it's caused by a combination of two factors.
1. This cluster has way to many placement groups per OSD (just north of
800). It was fine when we first created all the pools, but upgrades (most
recently t
Yeah, don't run these commands blind. They are changing the local metadata
of the PG in ways that may make it inconsistent with the overall cluster
and result in lost data.
Brett, it seems this issue has come up several times in the field but we
haven't been able to reproduce it locally or get eno
can you file tracker for your
issues(http://tracker.ceph.com/projects/ceph/issues/new) , email once
its lengthy is not great to track the issue, Ideally full details of
environment (os/ceph versions /before/after/workload info/ tool used
for upgrade) is important if one has to recreate it. There a
Hi,
Sorry to hear that. I’ve been battling with mine for 2 weeks :/
I’ve corrected mine OSDs with the following commands. My OSD logs
(/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG number
besides and before crash dump.
ceph-objectstore-tool --data-path /var/lib/ceph/os
Help. I have a 60 node cluster and most of the OSDs decided to crash
themselves at the same time. They wont restart, the messages look like...
--- begin dump of recent events ---
0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal
(Aborted) **
in thread 7f57ab5b7d80 thread_name:c