Calvin,

What does your crushmap look like?
ceph osd tree

I find it strange that 1023 PGs are undersized when only one OSD failed.

Bob

On Thu, Mar 31, 2016 at 9:27 AM, Calvin Morrow <calvin.mor...@gmail.com>
wrote:

>
>
> On Wed, Mar 30, 2016 at 5:24 PM Christian Balzer <ch...@gol.com> wrote:
>
>> On Wed, 30 Mar 2016 15:50:07 +0000 Calvin Morrow wrote:
>>
>> > On Wed, Mar 30, 2016 at 1:27 AM Christian Balzer <ch...@gol.com> wrote:
>> >
>> > >
>> > > Hello,
>> > >
>> > > On Tue, 29 Mar 2016 18:10:33 +0000 Calvin Morrow wrote:
>> > >
>> > > > Ceph cluster with 60 OSDs, Giant 0.87.2.  One of the OSDs failed due
>> > > > to a hardware error, however after normal recovery it seems stuck
>> > > > with one active+undersized+degraded+inconsistent pg.
>> > > >
>> > > Any reason (other than inertia, which I understand very well) you're
>> > > running a non LTS version that last saw bug fixes a year ago?
>> > > You may very well be facing a bug that has long been fixed even in
>> > > Firefly, let alone Hammer.
>> > >
>> > I know we discussed Hammer several times, and I don't remember the exact
>> > reason we held off.  Other than that, Inertia is probably the best
>> > answer I have.
>> >
>> Fair enough.
>>
>> I just seem to remember similar scenarios where recovery got stuck/hung
>> and thus would assume it was fixed in newer versions.
>>
>> If you google for "ceph recovery stuck" you find another potential
>> solution behind the RH paywall and this:
>>
>> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043894.html
>>
>> That would have been my next suggestion anyway, Ceph OSDs seem to take
>> well to the 'IT crowd' mantra of "Have you tried turning it off and on
>> again?". ^o^
>>
> Yeah, unfortunately that was something I tried before reaching out on the
> mailing list.  It didn't seem to change anything.
>
> In particular, I was noticing that my "ceph pg repair 12.28a" command
> never seemed to be acknowledged by the OSD.  I was hoping for some sort of
> log message, even an 'ERR', but while I saw messages about other pg scrubs,
> nothing shows up for the problem PG.  I tried before and after an OSD
> restart (both OSDs) without any apparent change.
>
>>
>> > >
>> > > If so, hopefully one of the devs remembering it can pipe up.
>> > >
>> > > > I haven't been able to get repair to happen using "ceph pg repair
>> > > > 12.28a"; I can see the activity logged in the mon logs, however the
>> > > > repair doesn't actually seem to happen in any of the actual osd
>> logs.
>> > > >
>> > > > I tried folowing Sebiastien's instructions for manually locating the
>> > > > inconsistent object (
>> > > >
>> http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
>> > > ),
>> > > > however the md5sum from the objects both match, so I'm not quite
>> > > > sure how to proceed.
>> > > >
>> > > Rolling a dice? ^o^
>> > > Do they have similar (identical really) timestamps as well?
>> > >
>> > Yes, timestamps are identical.
>> >
>> Unsurprisingly.
>>
>> > >
>> > > > Any ideas on how to return to a healthy cluster?
>> > > >
>> > > > [root@soi-ceph2 ceph]# ceph status
>> > > >     cluster 6cc00165-4956-4947-8605-53ba51acd42b
>> > > >      health HEALTH_ERR 1023 pgs degraded; 1 pgs inconsistent; 1023
>> > > > pgs stuck degraded; 1099 pgs stuck unclean; 1023 pgs stuck
>> > > > undersized; 1023 pgs undersized; recovery 132091/23742762 objects
>> > > > degraded (0.556%); 7745/23742762 objects misplaced (0.033%); 1 scrub
>> > > > errors monmap e5: 3 mons at {soi-ceph1=
>> > > >
>> 10.2.2.11:6789/0,soi-ceph2=10.2.2.12:6789/0,soi-ceph3=10.2.2.13:6789/0},
>> > > > election epoch 4132, quorum 0,1,2 soi-ceph1,soi-ceph2,soi-ceph3
>> > > >      osdmap e41120: 60 osds: 59 up, 59 in
>> > > >       pgmap v37432002: 61440 pgs, 15 pools, 30513 GB data, 7728
>> > > > kobjects 91295 GB used, 73500 GB / 160 TB avail
>> > > >             132091/23742762 objects degraded (0.556%); 7745/23742762
>> > > > objects misplaced (0.033%)
>> > > >                60341 active+clean
>> > > >                   76 active+remapped
>> > > >                 1022 active+undersized+degraded
>> > > >                    1 active+undersized+degraded+inconsistent
>> > > >   client io 44548 B/s rd, 19591 kB/s wr, 1095 op/s
>> > > >
>> > > What's confusing to me in this picture are the stuck and unclean PGs
>> as
>> > > well as degraded objects, it seems that recovery has stopped?
>> > >
>> > Yeah ... recovery essentially halted.  I'm sure its no accident that
>> > there are exactly 1023 (1024-1) unhealthy pgs.
>> >
>> > >
>> > > Something else that suggests a bug, or at least a stuck OSD.
>> > >
>> > > > [root@soi-ceph2 ceph]# ceph health detail | grep inconsistent
>> > > > pg 12.28a is stuck unclean for 126274.215835, current state
>> > > > active+undersized+degraded+inconsistent, last acting [36,52]
>> > > > pg 12.28a is stuck undersized for 3499.099747, current state
>> > > > active+undersized+degraded+inconsistent, last acting [36,52]
>> > > > pg 12.28a is stuck degraded for 3499.107051, current state
>> > > > active+undersized+degraded+inconsistent, last acting [36,52]
>> > > > pg 12.28a is active+undersized+degraded+inconsistent, acting [36,52]
>> > > >
>> > > > [root@soi-ceph2 ceph]# zgrep 'ERR' *.gz
>> > > > ceph-osd.36.log-20160325.gz:2016-03-24 12:00:43.568221 7fe7b2897700
>> > > > -1 log_channel(default) log [ERR] : 12.28a shard 20: soid
>> > > >
>> > >
>> c5cf428a/default.64340.11__shadow_.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO_106/head//12
>> > > > candidate had a read error, digest 2029411064 != known digest
>> > > > 2692480864
>> > >   ^^^^^^^^^^^^^^^^^^^^^^^^^^
>> > > That's the culprit, google for it. Of course the most promising
>> looking
>> > > answer is behind the RH pay wall.
>> > >
>> > This part is the most confusing for me.  To me, this should indicate
>> that
>> > there was some kind of bitrot on the disk (I'd love for ZFS to be better
>> > supported here).  What I don't understand is that the actual object has
>> > identical md5sums, timestamps, etc.  I don't know if this means there
>> was
>> > just a transient error that Ceph can't get over, or whether I'm
>> > mistakenly looking at the wrong object.  Maybe something stored in an
>> > xattr somewhere?
>> >
>> I could think of more scenarios, not knowing how either that checksum nor
>> mdsum work in detail.
>> Like one going through the pagecache, the other doesnt.
>> Or the checksum being corrupted, written out of order, etc.
>>
>> And transient errors should hopefully respond well to an OSD restart.
>>
>> Unfortunately not this time.
>
>> > >
>> > > Looks like that disk has an issue, guess you're not seeing this on
>> > > osd.52, right?
>> > >
>> > Correct.
>> >
>> > > Check osd.36's SMART status.
>> > >
>> > SMART is normal, no errors, all counters seem fine.
>> >
>> If there would be an actual issue with the HDD, I'd expect to see at least
>> some Pending or Offline sectors.
>>
>> > >
>> > > My guess is that you may have to set min_size to 1 and recover osd.35
>> > > as well, but don't take my word for it.
>> > >
>> > Thanks for the suggestion.  I'm holding out for the moment in case
>> > someone else reads this and has an "aha" moment.  At the moment, I'm not
>> > sure if it would be more dangerous to try and blow away the object on
>> > osd.36 and hope for recovery (with min_size 1) or try a software upgrade
>> > on an unhealthy cluster (yuck).
>> >
>> Well, see above.
>>
>> And yeah, neither of those two alternatives is particular alluring.
>> OTOH, you're looking at just one object versus a whole PG or OSD.
>>
>> The more I think about it, the more I seem to be convincing myself that
> your argument about it being a software error seems more likely.  That
> makes the option of setting min_size less appealing, because I have doubts
> that even ridding myself of that object will be acted on appropriately.
>
> I think I'll look more into previous 'stuck recovery' issues and see how
> they were handled.  If the consensus for those was 'upgrade' even amidst an
> unhealthy status, we'll probably try that route.
>
> Christian
>>
>> > >
>> > > Christian
>> > >
>> > > > ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970413 7fe7b2897700
>> > > > -1 log_channel(default) log [ERR] : 12.28a deep-scrub 0 missing, 1
>> > > > inconsistent objects
>> > > > ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970423 7fe7b2897700
>> > > > -1 log_channel(default) log [ERR] : 12.28a deep-scrub 1 errors
>> > > >
>> > > > [root@soi-ceph2 ceph]# md5sum
>> > > >
>> > >
>> /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
>> > > > \fb57b1f17421377bf2c35809f395e9b9
>> > > >
>> > >
>> /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
>> > > >
>> > > > [root@soi-ceph3 ceph]# md5sum
>> > > >
>> > >
>> /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
>> > > > \fb57b1f17421377bf2c35809f395e9b9
>> > > >
>> > >
>> /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
>> > >
>> > >
>> > > --
>> > > Christian Balzer        Network/Systems Engineer
>> > > ch...@gol.com           Global OnLine Japan/Rakuten Communications
>> > > http://www.gol.com/
>> > >
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> ch...@gol.com           Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to