On Tue, Oct 17, 2017 at 9:51 AM Ana Aviles <a...@greenhost.nl> wrote:

> Hello all,
>
> We had an inconsistent PG on our cluster. While performing PG repair
> operation the OSD crashed. The OSD was not able to start again anymore,
> and there was no hardware failure on the disk itself. This is the log
> output
>
> 2017-10-17 17:48:55.771384 7f234930d700 -1 log_channel(cluster) log
> [ERR] : 2.2fc repair 1 missing, 0 inconsistent objects
> 2017-10-17 17:48:55.771417 7f234930d700 -1 log_channel(cluster) log
> [ERR] : 2.2fc repair 3 errors, 1 fixed
> 2017-10-17 17:48:56.047896 7f234930d700 -1
> /build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: In function 'virtual void
> PrimaryLogPG::on_local_recover(const hobject_t&, const
> ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)'
> thread 7f234930d700 time 2017-10-17 17:48:55.924115
> /build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
> recovery_info.ss.clone_snaps.end())
>

Hmm. The OSD got a push op containing a snapshot it doesn't think should
exist. I also see that there's a comment "// hmm, should we warn?" on that
assert.

Can you take a full log with "debug osd = 20" set, post it with
ceph-post-file, and create a ticket on tracker.ceph.com?

Are all your OSDs running that same version?
-Greg


>
>  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x102) [0x56236c8ff3f2]
>  2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo
> const&, std::shared_ptr<ObjectContext>, bool,
> ObjectStore::Transaction*)+0xd63) [0x56236c476213]
>  3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&,
> PullOp*, std::__cxx11::list<ReplicatedBackend::pull_complete_info,
> std::allocator<ReplicatedBackend::pull_complete_info> >*,
> ObjectStore::Transaction*)+0x693) [0x56236c60d4d3]
>  4:
>
> (ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x2b5)
> [0x56236c60dd75]
>  5:
> (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x20c)
> [0x56236c61196c]
>  6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50)
> [0x56236c521aa0]
>  7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0x55d) [0x56236c48662d]
>  8: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9)
> [0x56236c3091a9]
>  9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
> const&)+0x57) [0x56236c5a2ae7]
>  10: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x130e) [0x56236c3307de]
>  11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884)
> [0x56236c9041e4]
>  12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56236c907220]
>  13: (()+0x76ba) [0x7f2366be96ba]
>  14: (clone()+0x6d) [0x7f2365c603dd]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
> Thanks!
>
> Ana
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to