On Tue, Oct 17, 2017 at 9:51 AM Ana Aviles <a...@greenhost.nl> wrote:
> Hello all, > > We had an inconsistent PG on our cluster. While performing PG repair > operation the OSD crashed. The OSD was not able to start again anymore, > and there was no hardware failure on the disk itself. This is the log > output > > 2017-10-17 17:48:55.771384 7f234930d700 -1 log_channel(cluster) log > [ERR] : 2.2fc repair 1 missing, 0 inconsistent objects > 2017-10-17 17:48:55.771417 7f234930d700 -1 log_channel(cluster) log > [ERR] : 2.2fc repair 3 errors, 1 fixed > 2017-10-17 17:48:56.047896 7f234930d700 -1 > /build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: In function 'virtual void > PrimaryLogPG::on_local_recover(const hobject_t&, const > ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' > thread 7f234930d700 time 2017-10-17 17:48:55.924115 > /build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != > recovery_info.ss.clone_snaps.end()) > Hmm. The OSD got a push op containing a snapshot it doesn't think should exist. I also see that there's a comment "// hmm, should we warn?" on that assert. Can you take a full log with "debug osd = 20" set, post it with ceph-post-file, and create a ticket on tracker.ceph.com? Are all your OSDs running that same version? -Greg > > ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x102) [0x56236c8ff3f2] > 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo > const&, std::shared_ptr<ObjectContext>, bool, > ObjectStore::Transaction*)+0xd63) [0x56236c476213] > 3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&, > PullOp*, std::__cxx11::list<ReplicatedBackend::pull_complete_info, > std::allocator<ReplicatedBackend::pull_complete_info> >*, > ObjectStore::Transaction*)+0x693) [0x56236c60d4d3] > 4: > > (ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x2b5) > [0x56236c60dd75] > 5: > (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x20c) > [0x56236c61196c] > 6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) > [0x56236c521aa0] > 7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, > ThreadPool::TPHandle&)+0x55d) [0x56236c48662d] > 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, > boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) > [0x56236c3091a9] > 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> > const&)+0x57) [0x56236c5a2ae7] > 10: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0x130e) [0x56236c3307de] > 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) > [0x56236c9041e4] > 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56236c907220] > 13: (()+0x76ba) [0x7f2366be96ba] > 14: (clone()+0x6d) [0x7f2365c603dd] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > Thanks! > > Ana > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com