On Fri, Apr 1, 2016 at 4:42 PM Bob R <b...@drinksbeer.org> wrote: > Calvin, > > What does your crushmap look like? > ceph osd tree > [root@soi-ceph1 ~]# ceph osd tree # id weight type name up/down reweight -1 163.8 root default -2 54.6 host soi-ceph1 0 2.73 osd.0 up 1 5 2.73 osd.5 up 1 10 2.73 osd.10 up 1 15 2.73 osd.15 up 1 20 2.73 osd.20 down 0 25 2.73 osd.25 up 1 30 2.73 osd.30 up 1 35 2.73 osd.35 up 1 40 2.73 osd.40 up 1 45 2.73 osd.45 up 1 50 2.73 osd.50 up 1 55 2.73 osd.55 up 1 60 2.73 osd.60 up 1 65 2.73 osd.65 up 1 70 2.73 osd.70 up 1 75 2.73 osd.75 up 1 80 2.73 osd.80 up 1 85 2.73 osd.85 up 1 90 2.73 osd.90 up 1 95 2.73 osd.95 up 1 -3 54.6 host soi-ceph2 1 2.73 osd.1 up 1 6 2.73 osd.6 up 1 11 2.73 osd.11 up 1 16 2.73 osd.16 up 1 21 2.73 osd.21 up 1 26 2.73 osd.26 up 1 31 2.73 osd.31 up 1 36 2.73 osd.36 up 1 41 2.73 osd.41 up 1 46 2.73 osd.46 up 1 51 2.73 osd.51 up 1 56 2.73 osd.56 up 1 61 2.73 osd.61 up 1 66 2.73 osd.66 up 1 71 2.73 osd.71 up 1 76 2.73 osd.76 up 1 81 2.73 osd.81 up 1 86 2.73 osd.86 up 1 91 2.73 osd.91 up 1 96 2.73 osd.96 up 1 -4 54.6 host soi-ceph3 2 2.73 osd.2 up 1 7 2.73 osd.7 up 1 12 2.73 osd.12 up 1 17 2.73 osd.17 up 1 22 2.73 osd.22 up 1 27 2.73 osd.27 up 1 32 2.73 osd.32 up 1 37 2.73 osd.37 up 1 42 2.73 osd.42 up 1 47 2.73 osd.47 up 1 52 2.73 osd.52 up 1 57 2.73 osd.57 up 1 62 2.73 osd.62 up 1 67 2.73 osd.67 up 1 72 2.73 osd.72 up 1 77 2.73 osd.77 up 1 82 2.73 osd.82 up 1 87 2.73 osd.87 up 1 92 2.73 osd.92 up 1 97 2.73 osd.97 up 1 -5 0 host soi-ceph4 -6 0 host soi-ceph5
> > I find it strange that 1023 PGs are undersized when only one OSD failed. > > Bob > > On Thu, Mar 31, 2016 at 9:27 AM, Calvin Morrow <calvin.mor...@gmail.com> > wrote: > >> >> >> On Wed, Mar 30, 2016 at 5:24 PM Christian Balzer <ch...@gol.com> wrote: >> >>> On Wed, 30 Mar 2016 15:50:07 +0000 Calvin Morrow wrote: >>> >>> > On Wed, Mar 30, 2016 at 1:27 AM Christian Balzer <ch...@gol.com> >>> wrote: >>> > >>> > > >>> > > Hello, >>> > > >>> > > On Tue, 29 Mar 2016 18:10:33 +0000 Calvin Morrow wrote: >>> > > >>> > > > Ceph cluster with 60 OSDs, Giant 0.87.2. One of the OSDs failed >>> due >>> > > > to a hardware error, however after normal recovery it seems stuck >>> > > > with one active+undersized+degraded+inconsistent pg. >>> > > > >>> > > Any reason (other than inertia, which I understand very well) you're >>> > > running a non LTS version that last saw bug fixes a year ago? >>> > > You may very well be facing a bug that has long been fixed even in >>> > > Firefly, let alone Hammer. >>> > > >>> > I know we discussed Hammer several times, and I don't remember the >>> exact >>> > reason we held off. Other than that, Inertia is probably the best >>> > answer I have. >>> > >>> Fair enough. >>> >>> I just seem to remember similar scenarios where recovery got stuck/hung >>> and thus would assume it was fixed in newer versions. >>> >>> If you google for "ceph recovery stuck" you find another potential >>> solution behind the RH paywall and this: >>> >>> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043894.html >>> >>> That would have been my next suggestion anyway, Ceph OSDs seem to take >>> well to the 'IT crowd' mantra of "Have you tried turning it off and on >>> again?". ^o^ >>> >> Yeah, unfortunately that was something I tried before reaching out on the >> mailing list. It didn't seem to change anything. >> >> In particular, I was noticing that my "ceph pg repair 12.28a" command >> never seemed to be acknowledged by the OSD. I was hoping for some sort of >> log message, even an 'ERR', but while I saw messages about other pg scrubs, >> nothing shows up for the problem PG. I tried before and after an OSD >> restart (both OSDs) without any apparent change. >> >>> >>> > > >>> > > If so, hopefully one of the devs remembering it can pipe up. >>> > > >>> > > > I haven't been able to get repair to happen using "ceph pg repair >>> > > > 12.28a"; I can see the activity logged in the mon logs, however the >>> > > > repair doesn't actually seem to happen in any of the actual osd >>> logs. >>> > > > >>> > > > I tried folowing Sebiastien's instructions for manually locating >>> the >>> > > > inconsistent object ( >>> > > > >>> http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/ >>> > > ), >>> > > > however the md5sum from the objects both match, so I'm not quite >>> > > > sure how to proceed. >>> > > > >>> > > Rolling a dice? ^o^ >>> > > Do they have similar (identical really) timestamps as well? >>> > > >>> > Yes, timestamps are identical. >>> > >>> Unsurprisingly. >>> >>> > > >>> > > > Any ideas on how to return to a healthy cluster? >>> > > > >>> > > > [root@soi-ceph2 ceph]# ceph status >>> > > > cluster 6cc00165-4956-4947-8605-53ba51acd42b >>> > > > health HEALTH_ERR 1023 pgs degraded; 1 pgs inconsistent; 1023 >>> > > > pgs stuck degraded; 1099 pgs stuck unclean; 1023 pgs stuck >>> > > > undersized; 1023 pgs undersized; recovery 132091/23742762 objects >>> > > > degraded (0.556%); 7745/23742762 objects misplaced (0.033%); 1 >>> scrub >>> > > > errors monmap e5: 3 mons at {soi-ceph1= >>> > > > >>> 10.2.2.11:6789/0,soi-ceph2=10.2.2.12:6789/0,soi-ceph3=10.2.2.13:6789/0}, >>> > > > election epoch 4132, quorum 0,1,2 soi-ceph1,soi-ceph2,soi-ceph3 >>> > > > osdmap e41120: 60 osds: 59 up, 59 in >>> > > > pgmap v37432002: 61440 pgs, 15 pools, 30513 GB data, 7728 >>> > > > kobjects 91295 GB used, 73500 GB / 160 TB avail >>> > > > 132091/23742762 objects degraded (0.556%); >>> 7745/23742762 >>> > > > objects misplaced (0.033%) >>> > > > 60341 active+clean >>> > > > 76 active+remapped >>> > > > 1022 active+undersized+degraded >>> > > > 1 active+undersized+degraded+inconsistent >>> > > > client io 44548 B/s rd, 19591 kB/s wr, 1095 op/s >>> > > > >>> > > What's confusing to me in this picture are the stuck and unclean PGs >>> as >>> > > well as degraded objects, it seems that recovery has stopped? >>> > > >>> > Yeah ... recovery essentially halted. I'm sure its no accident that >>> > there are exactly 1023 (1024-1) unhealthy pgs. >>> > >>> > > >>> > > Something else that suggests a bug, or at least a stuck OSD. >>> > > >>> > > > [root@soi-ceph2 ceph]# ceph health detail | grep inconsistent >>> > > > pg 12.28a is stuck unclean for 126274.215835, current state >>> > > > active+undersized+degraded+inconsistent, last acting [36,52] >>> > > > pg 12.28a is stuck undersized for 3499.099747, current state >>> > > > active+undersized+degraded+inconsistent, last acting [36,52] >>> > > > pg 12.28a is stuck degraded for 3499.107051, current state >>> > > > active+undersized+degraded+inconsistent, last acting [36,52] >>> > > > pg 12.28a is active+undersized+degraded+inconsistent, acting >>> [36,52] >>> > > > >>> > > > [root@soi-ceph2 ceph]# zgrep 'ERR' *.gz >>> > > > ceph-osd.36.log-20160325.gz:2016-03-24 12:00:43.568221 7fe7b2897700 >>> > > > -1 log_channel(default) log [ERR] : 12.28a shard 20: soid >>> > > > >>> > > >>> c5cf428a/default.64340.11__shadow_.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO_106/head//12 >>> > > > candidate had a read error, digest 2029411064 != known digest >>> > > > 2692480864 >>> > > ^^^^^^^^^^^^^^^^^^^^^^^^^^ >>> > > That's the culprit, google for it. Of course the most promising >>> looking >>> > > answer is behind the RH pay wall. >>> > > >>> > This part is the most confusing for me. To me, this should indicate >>> that >>> > there was some kind of bitrot on the disk (I'd love for ZFS to be >>> better >>> > supported here). What I don't understand is that the actual object has >>> > identical md5sums, timestamps, etc. I don't know if this means there >>> was >>> > just a transient error that Ceph can't get over, or whether I'm >>> > mistakenly looking at the wrong object. Maybe something stored in an >>> > xattr somewhere? >>> > >>> I could think of more scenarios, not knowing how either that checksum nor >>> mdsum work in detail. >>> Like one going through the pagecache, the other doesnt. >>> Or the checksum being corrupted, written out of order, etc. >>> >>> And transient errors should hopefully respond well to an OSD restart. >>> >>> Unfortunately not this time. >> >>> > > >>> > > Looks like that disk has an issue, guess you're not seeing this on >>> > > osd.52, right? >>> > > >>> > Correct. >>> > >>> > > Check osd.36's SMART status. >>> > > >>> > SMART is normal, no errors, all counters seem fine. >>> > >>> If there would be an actual issue with the HDD, I'd expect to see at >>> least >>> some Pending or Offline sectors. >>> >>> > > >>> > > My guess is that you may have to set min_size to 1 and recover osd.35 >>> > > as well, but don't take my word for it. >>> > > >>> > Thanks for the suggestion. I'm holding out for the moment in case >>> > someone else reads this and has an "aha" moment. At the moment, I'm >>> not >>> > sure if it would be more dangerous to try and blow away the object on >>> > osd.36 and hope for recovery (with min_size 1) or try a software >>> upgrade >>> > on an unhealthy cluster (yuck). >>> > >>> Well, see above. >>> >>> And yeah, neither of those two alternatives is particular alluring. >>> OTOH, you're looking at just one object versus a whole PG or OSD. >>> >>> The more I think about it, the more I seem to be convincing myself that >> your argument about it being a software error seems more likely. That >> makes the option of setting min_size less appealing, because I have doubts >> that even ridding myself of that object will be acted on appropriately. >> >> I think I'll look more into previous 'stuck recovery' issues and see how >> they were handled. If the consensus for those was 'upgrade' even amidst an >> unhealthy status, we'll probably try that route. >> >> Christian >>> >>> > > >>> > > Christian >>> > > >>> > > > ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970413 7fe7b2897700 >>> > > > -1 log_channel(default) log [ERR] : 12.28a deep-scrub 0 missing, 1 >>> > > > inconsistent objects >>> > > > ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970423 7fe7b2897700 >>> > > > -1 log_channel(default) log [ERR] : 12.28a deep-scrub 1 errors >>> > > > >>> > > > [root@soi-ceph2 ceph]# md5sum >>> > > > >>> > > >>> /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c >>> > > > \fb57b1f17421377bf2c35809f395e9b9 >>> > > > >>> > > >>> /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c >>> > > > >>> > > > [root@soi-ceph3 ceph]# md5sum >>> > > > >>> > > >>> /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c >>> > > > \fb57b1f17421377bf2c35809f395e9b9 >>> > > > >>> > > >>> /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c >>> > > >>> > > >>> > > -- >>> > > Christian Balzer Network/Systems Engineer >>> > > ch...@gol.com Global OnLine Japan/Rakuten Communications >>> > > http://www.gol.com/ >>> > > >>> >>> >>> -- >>> Christian Balzer Network/Systems Engineer >>> ch...@gol.com Global OnLine Japan/Rakuten Communications >>> http://www.gol.com/ >>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com