Has no-one any idea about this? If needed I can produce more information or diagnostics on request. I find it hard to believe that we are the only people experiencing this, and thus far we have lost about 40 OSDs to corruption due to this.
Regards Stuart Harland > On 24 May 2017, at 10:32, Stuart Harland <s.harl...@livelinktechnology.net> > wrote: > > Hello > > I think I’m running into a bug that is described at > http://tracker.ceph.com/issues/14213 <http://tracker.ceph.com/issues/14213> > for Hammer. > > However I’m running the latest version of Jewel 10.2.7, although I’m in the > middle of upgrading the cluster (from 10.2.5). At first it was on a couple of > nodes, but now it seems to be more pervasive. > > I have seen this issue with osd_map_cache_size set to 20 as well as 500, > which I increased to try and compensate for it. > > My two questions, are > > 1) is this fixed, if so in which version. > 2) is there a way to recover the damaged OSD metadata, as I really don’t want > to keep having to rebuild large numbers of disks based on something arbitrary. > > > SEEK_HOLE is disabled via 'filestore seek data hole' config option > -31> 2017-05-24 10:23:10.152349 7f24035e2800 0 > genericfilestorebackend(/var/lib/ceph/osd/txc1-1908) detect_features: splice > is s > upported > -30> 2017-05-24 10:23:10.182065 7f24035e2800 0 > genericfilestorebackend(/var/lib/ceph/osd/txc1-1908) detect_features: > syncfs(2) s > yscall fully supported (by glibc and kernel) > -29> 2017-05-24 10:23:10.182112 7f24035e2800 0 > xfsfilestorebackend(/var/lib/ceph/osd/txc1-1908) detect_feature: extsize is > disab > led by conf > -28> 2017-05-24 10:23:10.182839 7f24035e2800 1 leveldb: Recovering log > #23079 > -27> 2017-05-24 10:23:10.284173 7f24035e2800 1 leveldb: Delete type=0 > #23079 > > -26> 2017-05-24 10:23:10.284223 7f24035e2800 1 leveldb: Delete type=3 > #23078 > > -25> 2017-05-24 10:23:10.284807 7f24035e2800 0 > filestore(/var/lib/ceph/osd/txc1-1908) mount: enabling WRITEAHEAD journal > mode: c > heckpoint is not enabled > -24> 2017-05-24 10:23:10.285581 7f24035e2800 2 journal open > /var/lib/ceph/osd/txc1-1908/journal fsid 8dada68b-0d1c-4f2a-bc96-1d8 > 61577bc98 fs_op_seq 20363902 > -23> 2017-05-24 10:23:10.289523 7f24035e2800 1 journal _open > /var/lib/ceph/osd/txc1-1908/journal fd 18: 5367660544 bytes, block > size 4096 bytes, directio = 1, aio = 1 > -22> 2017-05-24 10:23:10.293733 7f24035e2800 2 journal open advancing > committed_seq 20363681 to fs op_seq 20363902 > -21> 2017-05-24 10:23:10.293743 7f24035e2800 2 journal read_entry -- not > readable > -20> 2017-05-24 10:23:10.293744 7f24035e2800 2 journal read_entry -- not > readable > -19> 2017-05-24 10:23:10.293745 7f24035e2800 3 journal journal_replay: > end of journal, done. > -18> 2017-05-24 10:23:10.297605 7f24035e2800 1 journal _open > /var/lib/ceph/osd/txc1-1908/journal fd 18: 5367660544 bytes, block > size 4096 bytes, directio = 1, aio = 1 > -17> 2017-05-24 10:23:10.298470 7f24035e2800 1 > filestore(/var/lib/ceph/osd/txc1-1908) upgrade > -16> 2017-05-24 10:23:10.298509 7f24035e2800 2 osd.1908 0 boot > -15> 2017-05-24 10:23:10.300096 7f24035e2800 1 <cls> > cls/replica_log/cls_replica_log.cc:141: Loaded replica log class! > -14> 2017-05-24 10:23:10.300384 7f24035e2800 1 <cls> > cls/user/cls_user.cc:375: Loaded user class! > -13> 2017-05-24 10:23:10.300617 7f24035e2800 0 <cls> > cls/hello/cls_hello.cc:305: loading cls_hello > -12> 2017-05-24 10:23:10.303748 7f24035e2800 1 <cls> > cls/refcount/cls_refcount.cc:232: Loaded refcount class! > -11> 2017-05-24 10:23:10.304120 7f24035e2800 1 <cls> > cls/version/cls_version.cc:228: Loaded version class! > -10> 2017-05-24 10:23:10.304439 7f24035e2800 1 <cls> > cls/log/cls_log.cc:317: Loaded log class! > -9> 2017-05-24 10:23:10.307437 7f24035e2800 1 <cls> > cls/rgw/cls_rgw.cc:3359: Loaded rgw class! > -8> 2017-05-24 10:23:10.307768 7f24035e2800 1 <cls> > cls/timeindex/cls_timeindex.cc:259: Loaded timeindex class! > -7> 2017-05-24 10:23:10.307927 7f24035e2800 0 <cls> > cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan > -6> 2017-05-24 10:23:10.308086 7f24035e2800 1 <cls> > cls/statelog/cls_statelog.cc:306: Loaded log class! > -5> 2017-05-24 10:23:10.315241 7f24035e2800 0 osd.1908 863035 crush map > has features 2234490552320, adjusting msgr requires for > clients > -4> 2017-05-24 10:23:10.315258 7f24035e2800 0 osd.1908 863035 crush map > has features 2234490552320 was 8705, adjusting msgr req > uires for mons > -3> 2017-05-24 10:23:10.315267 7f24035e2800 0 osd.1908 863035 crush map > has features 2234490552320, adjusting msgr requires for > osds > -2> 2017-05-24 10:23:10.441444 7f24035e2800 0 osd.1908 863035 load_pgs > -1> 2017-05-24 10:23:10.442608 7f24035e2800 -1 osd.1908 863035 load_pgs: > have pgid 11.3f5a at epoch 863078, but missing map. Crashing. > 0> 2017-05-24 10:23:10.444151 7f24035e2800 -1 osd/OSD.cc > <http://osd.cc/>: In function 'void OSD::load_pgs()' thread 7f24035e2800 time > 2017-05-24 10:23:10.442617 > osd/OSD.cc <http://osd.cc/>: 3189: FAILED assert(0 == "Missing map in > load_pgs") > > ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x8b) [0x55d1874be6db] > 2: (OSD::load_pgs()+0x1f9b) [0x55d186e6a26b] > 3: (OSD::init()+0x1f74) [0x55d186e7aec4] > 4: (main()+0x29d1) [0x55d186de1d71] > 5: (__libc_start_main()+0xf5) [0x7f24004fdf45] > 6: (()+0x356a47) [0x55d186e2aa47] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > Regards > > Stuart Harland >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com