On Thu, Jul 12, 2018 at 11:39 PM Alessandro De Salvo <alessandro.desa...@roma1.infn.it> wrote: > > Some progress, and more pain... > > I was able to recover the 200.00000000 using the ceph-objectstore-tool for > one of the OSDs (all identical copies) but trying to re-inject it just with > rados put was giving no error while the get was still giving the same I/O > error. So the solution was to rm the object and the put it again, that worked. > > However, after restarting one of the MDSes and seeting it to repaired, I've > hit another, similar problem: > > > 2018-07-12 17:04:41.999136 7f54c3f4e700 -1 log_channel(cluster) log [ERR] : > error reading table object 'mds0_inotable' -5 ((5) Input/output error) > > > Can I safely try to do the same as for object 200.00000000? Should I check > something before trying it? Again, checking the copies of the object, they > have identical md5sums on all the replicas. >
Yes, It should be safe. you also need to the same for several other objects. full object list are: 200.00000000 mds0_inotable 100.00000000.inode mds_snaptable 1.00000000.inode The first three objects are per-mds-rank. Ff you have enabled multi-active mds, you also need to update objects of other ranks. For mds.1, object names are 201.00000000, mds1_inotable and 101.00000000.inode. > Thanks, > > > Alessandro > > > Il 12/07/18 16:46, Alessandro De Salvo ha scritto: > > Unfortunately yes, all the OSDs were restarted a few times, but no change. > > Thanks, > > > Alessandro > > > Il 12/07/18 15:55, Paul Emmerich ha scritto: > > This might seem like a stupid suggestion, but: have you tried to restart the > OSDs? > > I've also encountered some random CRC errors that only showed up when trying > to read an object, > but not on scrubbing, that magically disappeared after restarting the OSD. > > However, in my case it was clearly related to > https://tracker.ceph.com/issues/22464 which doesn't > seem to be the issue here. > > Paul > > 2018-07-12 13:53 GMT+02:00 Alessandro De Salvo > <alessandro.desa...@roma1.infn.it>: >> >> >> Il 12/07/18 11:20, Alessandro De Salvo ha scritto: >> >>> >>> >>> Il 12/07/18 10:58, Dan van der Ster ha scritto: >>>> >>>> On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum <gfar...@redhat.com> wrote: >>>>> >>>>> On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo >>>>> <alessandro.desa...@roma1.infn.it> wrote: >>>>>> >>>>>> OK, I found where the object is: >>>>>> >>>>>> >>>>>> ceph osd map cephfs_metadata 200.00000000 >>>>>> osdmap e632418 pool 'cephfs_metadata' (10) object '200.00000000' -> pg >>>>>> 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23) >>>>>> >>>>>> >>>>>> So, looking at the osds 23, 35 and 18 logs in fact I see: >>>>>> >>>>>> >>>>>> osd.23: >>>>>> >>>>>> 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log >>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on >>>>>> 10:292cf221:::200.00000000:head >>>>>> >>>>>> >>>>>> osd.35: >>>>>> >>>>>> 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log >>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on >>>>>> 10:292cf221:::200.00000000:head >>>>>> >>>>>> >>>>>> osd.18: >>>>>> >>>>>> 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log >>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on >>>>>> 10:292cf221:::200.00000000:head >>>>>> >>>>>> >>>>>> So, basically the same error everywhere. >>>>>> >>>>>> I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may >>>>>> help. >>>>>> >>>>>> No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and >>>>>> no disk problems anywhere. No relevant errors in syslogs, the hosts are >>>>>> just fine. I cannot exclude an error on the RAID controllers, but 2 of >>>>>> the OSDs with 10.14 are on a SAN system and one on a different one, so I >>>>>> would tend to exclude they both had (silent) errors at the same time. >>>>> >>>>> >>>>> That's fairly distressing. At this point I'd probably try extracting the >>>>> object using ceph-objectstore-tool and seeing if it decodes properly as >>>>> an mds journal. If it does, you might risk just putting it back in place >>>>> to overwrite the crc. >>>>> >>>> Wouldn't it be easier to scrub repair the PG to fix the crc? >>> >>> >>> this is what I already instructed the cluster to do, a deep scrub, but I'm >>> not sure it could repair in case all replicas are bad, as it seems to be >>> the case. >> >> >> I finally managed (with the help of Dan), to perform the deep-scrub on pg >> 10.14, but the deep scrub did not detect anything wrong. Also trying to >> repair 10.14 has no effect. >> Still, trying to access the object I get in the OSDs: >> >> 2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster) log [ERR] : >> 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on >> 10:292cf221:::200.00000000:head >> >> Was deep-scrub supposed to detect the wrong crc? If yes, them it sounds like >> a bug. >> Can I force the repair someway? >> Thanks, >> >> Alessandro >> >>> >>>> >>>> Alessandro, did you already try a deep-scrub on pg 10.14? >>> >>> >>> I'm waiting for the cluster to do that, I've sent it earlier this morning. >>> >>>> I expect >>>> it'll show an inconsistent object. Though, I'm unsure if repair will >>>> correct the crc given that in this case *all* replicas have a bad crc. >>> >>> >>> Exactly, this is what I wonder too. >>> Cheers, >>> >>> Alessandro >>> >>>> >>>> --Dan >>>> >>>>> However, I'm also quite curious how it ended up that way, with a checksum >>>>> mismatch but identical data (and identical checksums!) across the three >>>>> replicas. Have you previously done some kind of scrub repair on the >>>>> metadata pool? Did the PG perhaps get backfilled due to cluster changes? >>>>> -Greg >>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> >>>>>> Alessandro >>>>>> >>>>>> >>>>>> >>>>>> Il 11/07/18 18:56, John Spray ha scritto: >>>>>>> >>>>>>> On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo >>>>>>> <alessandro.desa...@roma1.infn.it> wrote: >>>>>>>> >>>>>>>> Hi John, >>>>>>>> >>>>>>>> in fact I get an I/O error by hand too: >>>>>>>> >>>>>>>> >>>>>>>> rados get -p cephfs_metadata 200.00000000 200.00000000 >>>>>>>> error getting cephfs_metadata/200.00000000: (5) Input/output error >>>>>>> >>>>>>> Next step would be to go look for corresponding errors on your OSD >>>>>>> logs, system logs, and possibly also check things like the SMART >>>>>>> counters on your hard drives for possible root causes. >>>>>>> >>>>>>> John >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Can this be recovered someway? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> >>>>>>>> Alessandro >>>>>>>> >>>>>>>> >>>>>>>> Il 11/07/18 18:33, John Spray ha scritto: >>>>>>>>> >>>>>>>>> On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo >>>>>>>>> <alessandro.desa...@roma1.infn.it> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> after the upgrade to luminous 12.2.6 today, all our MDSes have been >>>>>>>>>> marked as damaged. Trying to restart the instances only result in >>>>>>>>>> standby MDSes. We currently have 2 filesystems active and 2 MDSes >>>>>>>>>> each. >>>>>>>>>> >>>>>>>>>> I found the following error messages in the mon: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> mds.0 <node1_IP>:6800/2412911269 down:damaged >>>>>>>>>> mds.1 <node2_IP>:6800/830539001 down:damaged >>>>>>>>>> mds.0 <node3_IP>:6800/4080298733 down:damaged >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Whenever I try to force the repaired state with ceph mds repaired >>>>>>>>>> <fs_name>:<rank> I get something like this in the MDS logs: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2018-07-11 13:20:41.597970 7ff7e010e700 0 mds.1.journaler.mdlog(ro) >>>>>>>>>> error getting journal off disk >>>>>>>>>> 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log >>>>>>>>>> [ERR] : Error recovering journal 0x201: (5) Input/output error >>>>>>>>> >>>>>>>>> An EIO reading the journal header is pretty scary. The MDS itself >>>>>>>>> probably can't tell you much more about this: you need to dig down >>>>>>>>> into the RADOS layer. Try reading the 200.00000000 object (that >>>>>>>>> happens to be the rank 0 journal header, every CephFS filesystem >>>>>>>>> should have one) using the `rados` command line tool. >>>>>>>>> >>>>>>>>> John >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Any attempt of running the journal export results in errors, like >>>>>>>>>> this one: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> cephfs-journal-tool --rank=cephfs:0 journal export backup.bin >>>>>>>>>> Error ((5) Input/output error)2018-07-11 17:01:30.631571 >>>>>>>>>> 7f94354fff00 -1 >>>>>>>>>> Header 200.00000000 is unreadable >>>>>>>>>> >>>>>>>>>> 2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal >>>>>>>>>> not >>>>>>>>>> readable, attempt object-by-object dump with `rados` >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Same happens for recover_dentries >>>>>>>>>> >>>>>>>>>> cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary >>>>>>>>>> Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header >>>>>>>>>> 200.00000000 is unreadable >>>>>>>>>> Errors: >>>>>>>>>> 0 >>>>>>>>>> >>>>>>>>>> Is there something I could try to do to have the cluster back? >>>>>>>>>> >>>>>>>>>> I was able to dump the contents of the metadata pool with rados >>>>>>>>>> export >>>>>>>>>> -p cephfs_metadata <filename> and I'm currently trying the procedure >>>>>>>>>> described in >>>>>>>>>> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery >>>>>>>>>> but I'm not sure if it will work as it's apparently doing nothing at >>>>>>>>>> the >>>>>>>>>> moment (maybe it's just very slow). >>>>>>>>>> >>>>>>>>>> Any help is appreciated, thanks! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Alessandro >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> ceph-users mailing list >>>>>>>>>> ceph-users@lists.ceph.com >>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@lists.ceph.com >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com