Calvin, What does your crushmap look like? ceph osd tree
I find it strange that 1023 PGs are undersized when only one OSD failed. Bob On Thu, Mar 31, 2016 at 9:27 AM, Calvin Morrow <calvin.mor...@gmail.com> wrote: > > > On Wed, Mar 30, 2016 at 5:24 PM Christian Balzer <ch...@gol.com> wrote: > >> On Wed, 30 Mar 2016 15:50:07 +0000 Calvin Morrow wrote: >> >> > On Wed, Mar 30, 2016 at 1:27 AM Christian Balzer <ch...@gol.com> wrote: >> > >> > > >> > > Hello, >> > > >> > > On Tue, 29 Mar 2016 18:10:33 +0000 Calvin Morrow wrote: >> > > >> > > > Ceph cluster with 60 OSDs, Giant 0.87.2. One of the OSDs failed due >> > > > to a hardware error, however after normal recovery it seems stuck >> > > > with one active+undersized+degraded+inconsistent pg. >> > > > >> > > Any reason (other than inertia, which I understand very well) you're >> > > running a non LTS version that last saw bug fixes a year ago? >> > > You may very well be facing a bug that has long been fixed even in >> > > Firefly, let alone Hammer. >> > > >> > I know we discussed Hammer several times, and I don't remember the exact >> > reason we held off. Other than that, Inertia is probably the best >> > answer I have. >> > >> Fair enough. >> >> I just seem to remember similar scenarios where recovery got stuck/hung >> and thus would assume it was fixed in newer versions. >> >> If you google for "ceph recovery stuck" you find another potential >> solution behind the RH paywall and this: >> >> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043894.html >> >> That would have been my next suggestion anyway, Ceph OSDs seem to take >> well to the 'IT crowd' mantra of "Have you tried turning it off and on >> again?". ^o^ >> > Yeah, unfortunately that was something I tried before reaching out on the > mailing list. It didn't seem to change anything. > > In particular, I was noticing that my "ceph pg repair 12.28a" command > never seemed to be acknowledged by the OSD. I was hoping for some sort of > log message, even an 'ERR', but while I saw messages about other pg scrubs, > nothing shows up for the problem PG. I tried before and after an OSD > restart (both OSDs) without any apparent change. > >> >> > > >> > > If so, hopefully one of the devs remembering it can pipe up. >> > > >> > > > I haven't been able to get repair to happen using "ceph pg repair >> > > > 12.28a"; I can see the activity logged in the mon logs, however the >> > > > repair doesn't actually seem to happen in any of the actual osd >> logs. >> > > > >> > > > I tried folowing Sebiastien's instructions for manually locating the >> > > > inconsistent object ( >> > > > >> http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/ >> > > ), >> > > > however the md5sum from the objects both match, so I'm not quite >> > > > sure how to proceed. >> > > > >> > > Rolling a dice? ^o^ >> > > Do they have similar (identical really) timestamps as well? >> > > >> > Yes, timestamps are identical. >> > >> Unsurprisingly. >> >> > > >> > > > Any ideas on how to return to a healthy cluster? >> > > > >> > > > [root@soi-ceph2 ceph]# ceph status >> > > > cluster 6cc00165-4956-4947-8605-53ba51acd42b >> > > > health HEALTH_ERR 1023 pgs degraded; 1 pgs inconsistent; 1023 >> > > > pgs stuck degraded; 1099 pgs stuck unclean; 1023 pgs stuck >> > > > undersized; 1023 pgs undersized; recovery 132091/23742762 objects >> > > > degraded (0.556%); 7745/23742762 objects misplaced (0.033%); 1 scrub >> > > > errors monmap e5: 3 mons at {soi-ceph1= >> > > > >> 10.2.2.11:6789/0,soi-ceph2=10.2.2.12:6789/0,soi-ceph3=10.2.2.13:6789/0}, >> > > > election epoch 4132, quorum 0,1,2 soi-ceph1,soi-ceph2,soi-ceph3 >> > > > osdmap e41120: 60 osds: 59 up, 59 in >> > > > pgmap v37432002: 61440 pgs, 15 pools, 30513 GB data, 7728 >> > > > kobjects 91295 GB used, 73500 GB / 160 TB avail >> > > > 132091/23742762 objects degraded (0.556%); 7745/23742762 >> > > > objects misplaced (0.033%) >> > > > 60341 active+clean >> > > > 76 active+remapped >> > > > 1022 active+undersized+degraded >> > > > 1 active+undersized+degraded+inconsistent >> > > > client io 44548 B/s rd, 19591 kB/s wr, 1095 op/s >> > > > >> > > What's confusing to me in this picture are the stuck and unclean PGs >> as >> > > well as degraded objects, it seems that recovery has stopped? >> > > >> > Yeah ... recovery essentially halted. I'm sure its no accident that >> > there are exactly 1023 (1024-1) unhealthy pgs. >> > >> > > >> > > Something else that suggests a bug, or at least a stuck OSD. >> > > >> > > > [root@soi-ceph2 ceph]# ceph health detail | grep inconsistent >> > > > pg 12.28a is stuck unclean for 126274.215835, current state >> > > > active+undersized+degraded+inconsistent, last acting [36,52] >> > > > pg 12.28a is stuck undersized for 3499.099747, current state >> > > > active+undersized+degraded+inconsistent, last acting [36,52] >> > > > pg 12.28a is stuck degraded for 3499.107051, current state >> > > > active+undersized+degraded+inconsistent, last acting [36,52] >> > > > pg 12.28a is active+undersized+degraded+inconsistent, acting [36,52] >> > > > >> > > > [root@soi-ceph2 ceph]# zgrep 'ERR' *.gz >> > > > ceph-osd.36.log-20160325.gz:2016-03-24 12:00:43.568221 7fe7b2897700 >> > > > -1 log_channel(default) log [ERR] : 12.28a shard 20: soid >> > > > >> > > >> c5cf428a/default.64340.11__shadow_.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO_106/head//12 >> > > > candidate had a read error, digest 2029411064 != known digest >> > > > 2692480864 >> > > ^^^^^^^^^^^^^^^^^^^^^^^^^^ >> > > That's the culprit, google for it. Of course the most promising >> looking >> > > answer is behind the RH pay wall. >> > > >> > This part is the most confusing for me. To me, this should indicate >> that >> > there was some kind of bitrot on the disk (I'd love for ZFS to be better >> > supported here). What I don't understand is that the actual object has >> > identical md5sums, timestamps, etc. I don't know if this means there >> was >> > just a transient error that Ceph can't get over, or whether I'm >> > mistakenly looking at the wrong object. Maybe something stored in an >> > xattr somewhere? >> > >> I could think of more scenarios, not knowing how either that checksum nor >> mdsum work in detail. >> Like one going through the pagecache, the other doesnt. >> Or the checksum being corrupted, written out of order, etc. >> >> And transient errors should hopefully respond well to an OSD restart. >> >> Unfortunately not this time. > >> > > >> > > Looks like that disk has an issue, guess you're not seeing this on >> > > osd.52, right? >> > > >> > Correct. >> > >> > > Check osd.36's SMART status. >> > > >> > SMART is normal, no errors, all counters seem fine. >> > >> If there would be an actual issue with the HDD, I'd expect to see at least >> some Pending or Offline sectors. >> >> > > >> > > My guess is that you may have to set min_size to 1 and recover osd.35 >> > > as well, but don't take my word for it. >> > > >> > Thanks for the suggestion. I'm holding out for the moment in case >> > someone else reads this and has an "aha" moment. At the moment, I'm not >> > sure if it would be more dangerous to try and blow away the object on >> > osd.36 and hope for recovery (with min_size 1) or try a software upgrade >> > on an unhealthy cluster (yuck). >> > >> Well, see above. >> >> And yeah, neither of those two alternatives is particular alluring. >> OTOH, you're looking at just one object versus a whole PG or OSD. >> >> The more I think about it, the more I seem to be convincing myself that > your argument about it being a software error seems more likely. That > makes the option of setting min_size less appealing, because I have doubts > that even ridding myself of that object will be acted on appropriately. > > I think I'll look more into previous 'stuck recovery' issues and see how > they were handled. If the consensus for those was 'upgrade' even amidst an > unhealthy status, we'll probably try that route. > > Christian >> >> > > >> > > Christian >> > > >> > > > ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970413 7fe7b2897700 >> > > > -1 log_channel(default) log [ERR] : 12.28a deep-scrub 0 missing, 1 >> > > > inconsistent objects >> > > > ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970423 7fe7b2897700 >> > > > -1 log_channel(default) log [ERR] : 12.28a deep-scrub 1 errors >> > > > >> > > > [root@soi-ceph2 ceph]# md5sum >> > > > >> > > >> /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c >> > > > \fb57b1f17421377bf2c35809f395e9b9 >> > > > >> > > >> /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c >> > > > >> > > > [root@soi-ceph3 ceph]# md5sum >> > > > >> > > >> /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c >> > > > \fb57b1f17421377bf2c35809f395e9b9 >> > > > >> > > >> /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c >> > > >> > > >> > > -- >> > > Christian Balzer Network/Systems Engineer >> > > ch...@gol.com Global OnLine Japan/Rakuten Communications >> > > http://www.gol.com/ >> > > >> >> >> -- >> Christian Balzer Network/Systems Engineer >> ch...@gol.com Global OnLine Japan/Rakuten Communications >> http://www.gol.com/ >> > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com