Re: [ceph-users] Help! OSDs across the cluster just crashed

2018-10-03 Thread Brett Chancellor
That turned out to be exactly the issue (And boy was it fun clearing pgs out on 71 OSDs). I think it's caused by a combination of two factors. 1. This cluster has way to many placement groups per OSD (just north of 800). It was fine when we first created all the pools, but upgrades (most recently t

Re: [ceph-users] Help! OSDs across the cluster just crashed

2018-10-03 Thread Gregory Farnum
Yeah, don't run these commands blind. They are changing the local metadata of the PG in ways that may make it inconsistent with the overall cluster and result in lost data. Brett, it seems this issue has come up several times in the field but we haven't been able to reproduce it locally or get eno

Re: [ceph-users] Help! OSDs across the cluster just crashed

2018-10-02 Thread Vasu Kulkarni
can you file tracker for your issues(http://tracker.ceph.com/projects/ceph/issues/new) , email once its lengthy is not great to track the issue, Ideally full details of environment (os/ceph versions /before/after/workload info/ tool used for upgrade) is important if one has to recreate it. There a

Re: [ceph-users] Help! OSDs across the cluster just crashed

2018-10-02 Thread Goktug Yildirim
Hi, Sorry to hear that. I’ve been battling with mine for 2 weeks :/ I’ve corrected mine OSDs with the following commands. My OSD logs (/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG number besides and before crash dump. ceph-objectstore-tool --data-path /var/lib/ceph/os

[ceph-users] Help! OSDs across the cluster just crashed

2018-10-02 Thread Brett Chancellor
Help. I have a 60 node cluster and most of the OSDs decided to crash themselves at the same time. They wont restart, the messages look like... --- begin dump of recent events --- 0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal (Aborted) ** in thread 7f57ab5b7d80 thread_name:c