Hello,

When i posted several days ago a crash nobody respondet as well. So i want
to share my thoughts and maybe help you to find it (even im prett new to
ceph and its code)


What i would do i your case:

- git checkout ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
nautilus (stable)
from github

- imo your crash is happening to a failed assert close to:


Nov 13 16:26:13 cn5 numactl: 4: (ceph::__ceph_assert_fail(char const*, char
const*, int, char const*)+0x199) [0x5649f7348d43]

Nov 13 16:26:13 cn5 numactl: 5: (ceph::__ceph_assertf_fail(char const*,
char const*, int, char const*, char const*, ...)+0) [0x5649f7348ec2]

Nov 13 16:26:13 cn5 numactl: 6: (()+0x8e7e60) [0x5649f77c3e60]

Nov 13 16:26:13 cn5 numactl: 7:
(CallClientContexts::finish(std::pair<RecoveryMessages*,
ECBackend::read_result_t&>&)+0x6b9) [0x5649f77d5bf9]

- search in the code for
CallClientContexts::finish and assets which could go wrong.

- Try to figure out why its failing and why the given assert could go wrong.


On my way i builded the monitors (src) myself with more debugging
information until i was able to solve it.

Hope it helps you out.

Greetings
Sascha

nokia ceph <nokiacephus...@gmail.com> schrieb am Mi., 13. Nov. 2019, 17:28:

> Hi,
>
> We have upgraded a 5 node ceph cluster from Luminous to Nautilus and the
> cluster was running fine. Yesterday when we tried to add one more osd into
> the ceph cluster we find that the OSD is created in the cluster but
> suddenly some of the other OSD's started to crash and we are not able to
> restart any of the OSD's in that particular node where we found this issue.
> Due to this we are not able to add the OSD's in other node and we are not
> able to bring up the cluster.
>
> The logs which are shown during the crash is below.
>
>
> Nov 13 16:26:13 cn5 numactl: ceph version 14.2.2
> (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
> Nov 13 16:26:13 cn5 numactl: 1: (()+0xf5d0) [0x7f488bb0f5d0]
> Nov 13 16:26:13 cn5 numactl: 2: (gsignal()+0x37) [0x7f488a8ff207]
> Nov 13 16:26:13 cn5 numactl: 3: (abort()+0x148) [0x7f488a9008f8]
> Nov 13 16:26:13 cn5 numactl: 4: (ceph::__ceph_assert_fail(char const*,
> char const*, int, char const*)+0x199) [0x5649f7348d43]
> Nov 13 16:26:13 cn5 numactl: 5: (ceph::__ceph_assertf_fail(char const*,
> char const*, int, char const*, char const*, ...)+0) [0x5649f7348ec2]
> Nov 13 16:26:13 cn5 numactl: 6: (()+0x8e7e60) [0x5649f77c3e60]
> Nov 13 16:26:13 cn5 numactl: 7:
> (CallClientContexts::finish(std::pair<RecoveryMessages*,
> ECBackend::read_result_t&>&)+0x6b9) [0x5649f77d5bf9]
> Nov 13 16:26:13 cn5 numactl: 8:
> (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8c)
> [0x5649f77ab02c]
> Nov 13 16:26:13 cn5 numactl: 9:
> (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
> RecoveryMessages*, ZTracer::Trace const&)+0xd57) [0x5649f77c5627]
> Nov 13 16:26:13 cn5 numactl: 10:
> (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x9f)
> [0x5649f77c60af]
> Nov 13 16:26:13 cn5 numactl: 11:
> (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x87)
> [0x5649f76a3467]
> Nov 13 16:26:13 cn5 numactl: 12:
> (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0x695) [0x5649f764f365]
> Nov 13 16:26:13 cn5 numactl: 13:
> (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>,
> ThreadPool::TPHandle&)+0x1a9) [0x5649f7489ea9]
> Nov 13 16:26:13 cn5 numactl: 14: (PGOpItem::run(OSD*, OSDShard*,
> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x5649f77275d2]
> Nov 13 16:26:13 cn5 numactl: 15: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x9f4) [0x5649f74a6ef4]
> Nov 13 16:26:13 cn5 numactl: 16:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433)
> [0x5649f7aa5ce3]
> Nov 13 16:26:13 cn5 numactl: 17:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5649f7aa8d80]
> Nov 13 16:26:13 cn5 numactl: 18: (()+0x7dd5) [0x7f488bb07dd5]
> Nov 13 16:26:13 cn5 numactl: 19: (clone()+0x6d) [0x7f488a9c6ead]
> Nov 13 16:26:13 cn5 numactl: NOTE: a copy of the executable, or `objdump
> -rdS <executable>` is needed to interpret this.
> Nov 13 16:26:13 cn5 systemd: ceph-osd@279.service: main process exited,
> code=killed, status=6/ABRT
>
>
> Could you please let us know what might be the issue and how to debug this?
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to