can you file tracker for your
issues(http://tracker.ceph.com/projects/ceph/issues/new) , email once
its lengthy is not great to track the issue, Ideally full details of
environment (os/ceph versions /before/after/workload info/ tool used
for upgrade) is important if one has to recreate it. There are
various upgrade tests in the suite, so it might be a miss, please file
a tracker with details. Thanks
On Tue, Oct 2, 2018 at 3:18 PM Goktug Yildirim
<goktug.yildi...@gmail.com> wrote:
>
> Hi,
>
> Sorry to hear that. I’ve been battling with mine for 2 weeks :/
>
> I’ve corrected mine OSDs with the following commands. My OSD logs
> (/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG
> number besides and before crash dump.
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op trim-pg-log
> --pgid $2
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op fix-lost
> --pgid $2
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op repair
> --pgid $2
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
> mark-complete --pgid $2
> systemctl restart ceph-osd@$1
>
> I dont know if it works for you but it may be no harm to try for an OSD.
>
> There is such less information about this tools. So it might be risky. I hope
> someone much experienced could help more.
>
>
> > On 2 Oct 2018, at 23:23, Brett Chancellor <bchancel...@salesforce.com>
> > wrote:
> >
> > Help. I have a 60 node cluster and most of the OSDs decided to crash
> > themselves at the same time. They wont restart, the messages look like...
> >
> > --- begin dump of recent events ---
> > 0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal
> > (Aborted) **
> > in thread 7f57ab5b7d80 thread_name:ceph-osd
> >
> > ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous
> > (stable)
> > 1: (()+0xa3c611) [0x556d618bb611]
> > 2: (()+0xf6d0) [0x7f57a885e6d0]
> > 3: (gsignal()+0x37) [0x7f57a787f277]
> > 4: (abort()+0x148) [0x7f57a7880968]
> > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x284) [0x556d618fa6e4]
> > 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
> > const&)+0x3b2) [0x556d615c74a2]
> > 7: (PastIntervals::check_new_interval(int, int, std::vector<int,
> > std::allocator<int> > const&, std::vector<int, std::allocator<int> >
> > const&, int, int, std::vector<int, std::allocator<int> > const&,
> > std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int,
> > std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
> > IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
> > [0x556d615ae6c0]
> > 8: (OSD::build_past_intervals_parallel()+0x9ff) [0x556d613707af]
> > 9: (OSD::load_pgs()+0x545) [0x556d61373095]
> > 10: (OSD::init()+0x2169) [0x556d613919d9]
> > 11: (main()+0x2d07) [0x556d61295dd7]
> > 12: (__libc_start_main()+0xf5) [0x7f57a786b445]
> > 13: (()+0x4b53e3) [0x556d613343e3]
> > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> > to interpret this.
> >
> >
> > Some hosts have no working OSDs, others seem to have 1 working, and 2 dead.
> > It's spread all across the cluster, across several different racks. Any
> > idea on where to look next? The cluster is dead in the water right now.
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com