It looks to me like this is related to http://tracker.ceph.com/issues/18162.
You might see if they came up with good resolution steps, and it looks like David is working on it in master but hasn't finished it yet. -Greg On Sat, Jun 3, 2017 at 2:47 AM Ashley Merrick <ash...@amerrick.co.uk> wrote: > Attaching with logging to level 20. > > > > After repeat attempts by removing nobackfill I have got it down to: > > > > > > recovery 31892/272325586 objects degraded (0.012%) > > recovery 2/272325586 objects misplaced (0.000%) > > > > However any further attempts after removing nobackfill just causes an > instant crash on 83 & 84, at this point I feel there is some corruption on > the remaining 11 OSD’s of the PG however the error’s aren’t directly saying > that, however always end the crash with: > > > > -1 *** Caught signal (Aborted) ** in thread 7f716e862700 > thread_name:tp_osd_recov > > > > ,Ashley > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf > Of *Ashley Merrick > *Sent:* 03 June 2017 17:14 > *To:* ceph-us...@ceph.com > *Subject:* Re: [ceph-users] PG Stuck EC Pool > > > > > This sender failed our fraud detection checks and may not be who they appear > to be. Learn about > spoofing <http://aka.ms/LearnAboutSpoofing> > > Feedback <http://aka.ms/SafetyTipsFeedback> > > I have now done some further testing and seeing these errors on 84 / 83 > the OSD’s that crash while backfilling to 10,11 > > > > -60> 2017-06-03 10:08:56.651768 7f6f76714700 1 -- > 172.16.3.14:6823/2694 <== osd.3 172.16.2.101:0/25361 10 ==== > osd_ping(ping e71688 stamp 2017-06-03 10:08:56.652035) v2 ==== 47+0+0 > (1097709006 0 0) 0x5569ea88d400 con 0x5569e900e300 > > -59> 2017-06-03 10:08:56.651804 7f6f76714700 1 -- > 172.16.3.14:6823/2694 --> 172.16.2.101:0/25361 -- osd_ping(ping_reply > e71688 stamp 2017-06-03 10:08:56.652035) v2 -- ?+0 0x5569e985fc00 con > 0x5569e900e300 > > -6> 2017-06-03 10:08:56.937156 7f6f5ee4d700 1 -- > 172.16.3.14:6822/2694 <== osd.53 172.16.3.7:6816/15230 13 ==== > MOSDECSubOpReadReply(6.14s3 71688 ECSubReadReply(tid=83, attrs_read=0)) v1 > ==== 148+0+0 (2355392791 0 0) 0x5569e8b22080 con 0x5569e9538f00 > > -5> 2017-06-03 10:08:56.937193 7f6f5ee4d700 5 -- op tracker -- seq: > 2409, time: 2017-06-03 10:08:56.937193, event: queued_for_pg, op: > MOSDECSubOpReadReply(6.14s3 71688 ECSubReadReply(tid=83, attrs_read=0)) > > -4> 2017-06-03 10:08:56.937241 7f6f8ef8a700 5 -- op tracker -- seq: > 2409, time: 2017-06-03 10:08:56.937240, event: reached_pg, op: > MOSDECSubOpReadReply(6.14s3 71688 ECSubReadReply(tid=83, attrs_read=0)) > > -3> 2017-06-03 10:08:56.937266 7f6f8ef8a700 0 osd.83 pg_epoch: 71688 > pg[6.14s3( v 71685'35512 (68694'30812,71685'35512] local-les=71688 n=15928 > ec=31534 les/c/f 71688/69510/67943 71687/71687/71687) > [11,10,2147483647,83,22,26,69,72,53,59,8,4,46]/[2147483647,2147483647,2147483647,83,22,26,69,72,53,59,8,4,46] > r=3 lpr=71687 pi=47065-71686/711 rops=1 bft=10(1),11(0) crt=71629'35509 > mlcod 0'0 active+undersized+degraded+remapped+inconsistent+backfilling > NIBBLEWISE] failed_push > 6:28170432:::rbd_data.e3d8852ae8944a.0000000000047d28:head from shard > 53(8), reps on unfound? 0 > > -2> 2017-06-03 10:08:56.937346 7f6f8ef8a700 5 -- op tracker -- seq: > 2409, time: 2017-06-03 10:08:56.937345, event: done, op: > MOSDECSubOpReadReply(6.14s3 71688 ECSubReadReply(tid=83, attrs_read=0)) > > -1> 2017-06-03 10:08:56.937351 7f6f89f80700 -1 osd.83 pg_epoch: 71688 > pg[6.14s3( v 71685'35512 (68694'30812,71685'35512] local-les=71688 n=15928 > ec=31534 les/c/f 71688/69510/67943 71687/71687/71687) > [11,10,2147483647,83,22,26,69,72,53,59,8,4,46]/[2147483647,2147483647,2147483647,83,22,26,69,72,53,59,8,4,46] > r=3 lpr=71687 pi=47065-71686/711 bft=10(1),11(0) crt=71629'35509 mlcod 0'0 > active+undersized+degraded+remapped+inconsistent+backfilling *NIBBLEWISE] > recover_replicas: object added to missing set for backfill, but is not in > recovering, error!* > > -42> 2017-06-03 10:08:56.968433 7f6f5f04f700 1 -- > 172.16.2.114:6822/2694 <== client.22857445 172.16.2.212:0/2238053329 56 > ==== osd_op(client.22857445.1:759236283 2.e732321d > rbd_data.61b4c6238e1f29.000000000001ea27 [set-alloc-hint object_size > 4194304 write_size 4194304,write 126976~45056] snapc 0=[] ondisk+write > e71688) v4 ==== 217+0+45056 (2626314663 0 3883338397) 0x5569ea886b00 con > 0x5569ea99c880 > > > > *From:* Ashley Merrick > *Sent:* 03 June 2017 14:27 > *To:* 'ceph-us...@ceph.com' <ceph-us...@ceph.com> > *Subject:* RE: PG Stuck EC Pool > > > > From this extract from pg query: > > > > "up": [ > > 11, > > 10, > > 84, > > 83, > > 22, > > 26, > > 69, > > 72, > > 53, > > 59, > > 8, > > 4, > > 46 > > ], > > "acting": [ > > 2147483647 <(214)%20748-3647>, > > 2147483647 <(214)%20748-3647>, > > 84, > > 83, > > 22, > > 26, > > 69, > > 72, > > 53, > > 59, > > 8, > > 4, > > 46 > > > > I am wondering if there is an issue on 11 , 10 causing the current active > primary “acting_primar": 84” to crash. > > > > But can’t see anything that could be causing it. > > > > ,Ashley > > > > *From:* Ashley Merrick > *Sent:* 01 June 2017 23:39 > *To:* ceph-us...@ceph.com > *Subject:* RE: PG Stuck EC Pool > > > > Have attached the full pg query for the effected PG encase this shows > anything of interest. > > > > Thanks > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com > <ceph-users-boun...@lists.ceph.com>] *On Behalf Of *Ashley Merrick > *Sent:* 01 June 2017 17:19 > *To:* ceph-us...@ceph.com > *Subject:* [ceph-users] PG Stuck EC Pool > > > > > This sender failed our fraud detection checks and may not be who they appear > to be. Learn about > spoofing <http://aka.ms/LearnAboutSpoofing> > > Feedback <http://aka.ms/SafetyTipsFeedback> > > Have a PG which is stuck in this state (Is an EC with K=10 M=3) > > > > > > pg 6.14 is active+undersized+degraded+remapped+inconsistent+backfilling, > acting [2147483647 <(214)%20748-3647>,2147483647 <(214)%20748-3647> > ,84,83,22,26,69,72,53,59,8,4,46] > > > > Currently have no-recover set, if I unset no recover both OSD 83 + 84 > start to flap and go up and down, I see the following in the log's of the > OSD. > > > > ***** > > -5> 2017-06-01 10:08:29.658593 7f430ec97700 1 -- > 172.16.3.14:6806/5204 <== osd.17 172.16.3.3:6806/2006016 57 ==== > MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, > last_complete=0'0, committed=0, applied=1)) v1 ==== 67+0+0 (245959818 0 0) > 0x563c9db7be00 con 0x563c9cfca480 > > -4> 2017-06-01 10:08:29.658620 7f430ec97700 5 -- op tracker -- seq: > 2367, time: 2017-06-01 10:08:29.658620, event: queued_for_pg, op: > MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, > last_complete=0'0, committed=0, applied=1)) > > -3> 2017-06-01 10:08:29.658649 7f4319e11700 5 -- op tracker -- seq: > 2367, time: 2017-06-01 10:08:29.658649, event: reached_pg, op: > MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, > last_complete=0'0, committed=0, applied=1)) > > -2> 2017-06-01 10:08:29.658661 7f4319e11700 5 -- op tracker -- seq: > 2367, time: 2017-06-01 10:08:29.658660, event: done, op: > MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, > last_complete=0'0, committed=0, applied=1)) > > -1> 2017-06-01 10:08:29.663107 7f43320ec700 5 -- op tracker -- seq: > 2317, time: 2017-06-01 10:08:29.663107, event: sub_op_applied, op: > osd_op(osd.79.66617:8675008 6.82058b1a > rbd_data.e5208a238e1f29.0000000000025f3e [copy-from ver 4678410] snapc 0=[] > ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e71513) > > 0> 2017-06-01 10:08:29.663474 7f4319610700 -1 *** Caught signal > (Aborted) ** > > in thread 7f4319610700 thread_name:tp_osd_recov > > > > ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) > > 1: (()+0x9564a7) [0x563c6a6f24a7] > > 2: (()+0xf890) [0x7f4342308890] > > 3: (gsignal()+0x37) [0x7f434034f067] > > 4: (abort()+0x148) [0x7f4340350448] > > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x256) [0x563c6a7f83d6] > > 6: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0x62f) > [0x563c6a2850ff] > > 7: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&, > int*)+0xa8a) [0x563c6a2b878a] > > 8: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x36d) [0x563c6a131bbd] > > 9: (ThreadPool::WorkQueue<PG>::_void_process(void*, > ThreadPool::TPHandle&)+0x1d) [0x563c6a17c88d] > > 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa9f) [0x563c6a7e8e3f] > > 11: (ThreadPool::WorkThread::entry()+0x10) [0x563c6a7e9d70] > > 12: (()+0x8064) [0x7f4342301064] > > 13: (clone()+0x6d) [0x7f434040262d] > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > ***** > > > > > > What should my next steps be? > > > > Thanks! > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com