Hello, When i posted several days ago a crash nobody respondet as well. So i want to share my thoughts and maybe help you to find it (even im prett new to ceph and its code)
What i would do i your case: - git checkout ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable) from github - imo your crash is happening to a failed assert close to: Nov 13 16:26:13 cn5 numactl: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x5649f7348d43] Nov 13 16:26:13 cn5 numactl: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x5649f7348ec2] Nov 13 16:26:13 cn5 numactl: 6: (()+0x8e7e60) [0x5649f77c3e60] Nov 13 16:26:13 cn5 numactl: 7: (CallClientContexts::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x6b9) [0x5649f77d5bf9] - search in the code for CallClientContexts::finish and assets which could go wrong. - Try to figure out why its failing and why the given assert could go wrong. On my way i builded the monitors (src) myself with more debugging information until i was able to solve it. Hope it helps you out. Greetings Sascha nokia ceph <nokiacephus...@gmail.com> schrieb am Mi., 13. Nov. 2019, 17:28: > Hi, > > We have upgraded a 5 node ceph cluster from Luminous to Nautilus and the > cluster was running fine. Yesterday when we tried to add one more osd into > the ceph cluster we find that the OSD is created in the cluster but > suddenly some of the other OSD's started to crash and we are not able to > restart any of the OSD's in that particular node where we found this issue. > Due to this we are not able to add the OSD's in other node and we are not > able to bring up the cluster. > > The logs which are shown during the crash is below. > > > Nov 13 16:26:13 cn5 numactl: ceph version 14.2.2 > (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable) > Nov 13 16:26:13 cn5 numactl: 1: (()+0xf5d0) [0x7f488bb0f5d0] > Nov 13 16:26:13 cn5 numactl: 2: (gsignal()+0x37) [0x7f488a8ff207] > Nov 13 16:26:13 cn5 numactl: 3: (abort()+0x148) [0x7f488a9008f8] > Nov 13 16:26:13 cn5 numactl: 4: (ceph::__ceph_assert_fail(char const*, > char const*, int, char const*)+0x199) [0x5649f7348d43] > Nov 13 16:26:13 cn5 numactl: 5: (ceph::__ceph_assertf_fail(char const*, > char const*, int, char const*, char const*, ...)+0) [0x5649f7348ec2] > Nov 13 16:26:13 cn5 numactl: 6: (()+0x8e7e60) [0x5649f77c3e60] > Nov 13 16:26:13 cn5 numactl: 7: > (CallClientContexts::finish(std::pair<RecoveryMessages*, > ECBackend::read_result_t&>&)+0x6b9) [0x5649f77d5bf9] > Nov 13 16:26:13 cn5 numactl: 8: > (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8c) > [0x5649f77ab02c] > Nov 13 16:26:13 cn5 numactl: 9: > (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, > RecoveryMessages*, ZTracer::Trace const&)+0xd57) [0x5649f77c5627] > Nov 13 16:26:13 cn5 numactl: 10: > (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x9f) > [0x5649f77c60af] > Nov 13 16:26:13 cn5 numactl: 11: > (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x87) > [0x5649f76a3467] > Nov 13 16:26:13 cn5 numactl: 12: > (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, > ThreadPool::TPHandle&)+0x695) [0x5649f764f365] > Nov 13 16:26:13 cn5 numactl: 13: > (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, > ThreadPool::TPHandle&)+0x1a9) [0x5649f7489ea9] > Nov 13 16:26:13 cn5 numactl: 14: (PGOpItem::run(OSD*, OSDShard*, > boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x5649f77275d2] > Nov 13 16:26:13 cn5 numactl: 15: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0x9f4) [0x5649f74a6ef4] > Nov 13 16:26:13 cn5 numactl: 16: > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) > [0x5649f7aa5ce3] > Nov 13 16:26:13 cn5 numactl: 17: > (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5649f7aa8d80] > Nov 13 16:26:13 cn5 numactl: 18: (()+0x7dd5) [0x7f488bb07dd5] > Nov 13 16:26:13 cn5 numactl: 19: (clone()+0x6d) [0x7f488a9c6ead] > Nov 13 16:26:13 cn5 numactl: NOTE: a copy of the executable, or `objdump > -rdS <executable>` is needed to interpret this. > Nov 13 16:26:13 cn5 systemd: ceph-osd@279.service: main process exited, > code=killed, status=6/ABRT > > > Could you please let us know what might be the issue and how to debug this? > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com