Yeah, don't run these commands blind. They are changing the local metadata
of the PG in ways that may make it inconsistent with the overall cluster
and result in lost data.

Brett, it seems this issue has come up several times in the field but we
haven't been able to reproduce it locally or get enough info to debug
what's going on: https://tracker.ceph.com/issues/21142
Maybe run through that ticket and see if you can contribute new logs or add
detail about possible sources?
-Greg

On Tue, Oct 2, 2018 at 3:18 PM Goktug Yildirim <goktug.yildi...@gmail.com>
wrote:

> Hi,
>
> Sorry to hear that. I’ve been battling with mine for 2 weeks :/
>
> I’ve corrected mine OSDs with the following commands. My OSD logs
> (/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG
> number besides and before crash dump.
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
> trim-pg-log --pgid $2
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op fix-lost
> --pgid $2
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op repair
> --pgid $2
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
> mark-complete --pgid $2
> systemctl restart ceph-osd@$1
>
> I dont know if it works for you but it may be no harm to try for an OSD.
>
> There is such less information about this tools. So it might be risky. I
> hope someone much experienced could help more.
>
>
> > On 2 Oct 2018, at 23:23, Brett Chancellor <bchancel...@salesforce.com>
> wrote:
> >
> > Help. I have a 60 node cluster and most of the OSDs decided to crash
> themselves at the same time. They wont restart, the messages look like...
> >
> > --- begin dump of recent events ---
> >      0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal
> (Aborted) **
> >  in thread 7f57ab5b7d80 thread_name:ceph-osd
> >
> >  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous
> (stable)
> >  1: (()+0xa3c611) [0x556d618bb611]
> >  2: (()+0xf6d0) [0x7f57a885e6d0]
> >  3: (gsignal()+0x37) [0x7f57a787f277]
> >  4: (abort()+0x148) [0x7f57a7880968]
> >  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x284) [0x556d618fa6e4]
> >  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
> const&)+0x3b2) [0x556d615c74a2]
> >  7: (PastIntervals::check_new_interval(int, int, std::vector<int,
> std::allocator<int> > const&, std::vector<int, std::allocator<int> >
> const&, int, int, std::vector<int, std::allocator<int> > const&,
> std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int,
> std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
> [0x556d615ae6c0]
> >  8: (OSD::build_past_intervals_parallel()+0x9ff) [0x556d613707af]
> >  9: (OSD::load_pgs()+0x545) [0x556d61373095]
> >  10: (OSD::init()+0x2169) [0x556d613919d9]
> >  11: (main()+0x2d07) [0x556d61295dd7]
> >  12: (__libc_start_main()+0xf5) [0x7f57a786b445]
> >  13: (()+0x4b53e3) [0x556d613343e3]
> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> >
> >
> > Some hosts have no working OSDs, others seem to have 1 working, and 2
> dead.  It's spread all across the cluster, across several different racks.
> Any idea on where to look next? The cluster is dead in the water right now.
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to