Has no-one any idea about this? If needed I can produce more information or 
diagnostics on request. I find it hard to believe that we are the only people 
experiencing this, and thus far we have lost about 40 OSDs to corruption due to 
this.

Regards 

Stuart Harland


> On 24 May 2017, at 10:32, Stuart Harland <s.harl...@livelinktechnology.net> 
> wrote:
> 
> Hello
> 
> I think I’m running into a bug that is described at 
> http://tracker.ceph.com/issues/14213 <http://tracker.ceph.com/issues/14213> 
> for Hammer.
> 
> However I’m running the latest version of Jewel 10.2.7, although I’m in the 
> middle of upgrading the cluster (from 10.2.5). At first it was on a couple of 
> nodes, but now it seems to be more pervasive.
> 
> I have seen this issue with osd_map_cache_size set to 20 as well as 500, 
> which I increased to try and compensate for it.
> 
> My two questions, are 
> 
> 1) is this fixed, if so in which version.
> 2) is there a way to recover the damaged OSD metadata, as I really don’t want 
> to keep having to rebuild large numbers of disks based on something arbitrary.
> 
> 
> SEEK_HOLE is disabled via 'filestore seek data hole' config option
>    -31> 2017-05-24 10:23:10.152349 7f24035e2800  0 
> genericfilestorebackend(/var/lib/ceph/osd/txc1-1908) detect_features: splice 
> is s
> upported
>    -30> 2017-05-24 10:23:10.182065 7f24035e2800  0 
> genericfilestorebackend(/var/lib/ceph/osd/txc1-1908) detect_features: 
> syncfs(2) s
> yscall fully supported (by glibc and kernel)
>    -29> 2017-05-24 10:23:10.182112 7f24035e2800  0 
> xfsfilestorebackend(/var/lib/ceph/osd/txc1-1908) detect_feature: extsize is 
> disab
> led by conf
>    -28> 2017-05-24 10:23:10.182839 7f24035e2800  1 leveldb: Recovering log 
> #23079
>    -27> 2017-05-24 10:23:10.284173 7f24035e2800  1 leveldb: Delete type=0 
> #23079
> 
>    -26> 2017-05-24 10:23:10.284223 7f24035e2800  1 leveldb: Delete type=3 
> #23078
> 
>    -25> 2017-05-24 10:23:10.284807 7f24035e2800  0 
> filestore(/var/lib/ceph/osd/txc1-1908) mount: enabling WRITEAHEAD journal 
> mode: c
> heckpoint is not enabled
>    -24> 2017-05-24 10:23:10.285581 7f24035e2800  2 journal open 
> /var/lib/ceph/osd/txc1-1908/journal fsid 8dada68b-0d1c-4f2a-bc96-1d8
> 61577bc98 fs_op_seq 20363902
>    -23> 2017-05-24 10:23:10.289523 7f24035e2800  1 journal _open 
> /var/lib/ceph/osd/txc1-1908/journal fd 18: 5367660544 bytes, block
> size 4096 bytes, directio = 1, aio = 1
>    -22> 2017-05-24 10:23:10.293733 7f24035e2800  2 journal open advancing 
> committed_seq 20363681 to fs op_seq 20363902
>    -21> 2017-05-24 10:23:10.293743 7f24035e2800  2 journal read_entry -- not 
> readable
>    -20> 2017-05-24 10:23:10.293744 7f24035e2800  2 journal read_entry -- not 
> readable
>    -19> 2017-05-24 10:23:10.293745 7f24035e2800  3 journal journal_replay: 
> end of journal, done.
>    -18> 2017-05-24 10:23:10.297605 7f24035e2800  1 journal _open 
> /var/lib/ceph/osd/txc1-1908/journal fd 18: 5367660544 bytes, block
> size 4096 bytes, directio = 1, aio = 1
>    -17> 2017-05-24 10:23:10.298470 7f24035e2800  1 
> filestore(/var/lib/ceph/osd/txc1-1908) upgrade
>    -16> 2017-05-24 10:23:10.298509 7f24035e2800  2 osd.1908 0 boot
>    -15> 2017-05-24 10:23:10.300096 7f24035e2800  1 <cls> 
> cls/replica_log/cls_replica_log.cc:141: Loaded replica log class!
>    -14> 2017-05-24 10:23:10.300384 7f24035e2800  1 <cls> 
> cls/user/cls_user.cc:375: Loaded user class!
>    -13> 2017-05-24 10:23:10.300617 7f24035e2800  0 <cls> 
> cls/hello/cls_hello.cc:305: loading cls_hello
>    -12> 2017-05-24 10:23:10.303748 7f24035e2800  1 <cls> 
> cls/refcount/cls_refcount.cc:232: Loaded refcount class!
>    -11> 2017-05-24 10:23:10.304120 7f24035e2800  1 <cls> 
> cls/version/cls_version.cc:228: Loaded version class!
>    -10> 2017-05-24 10:23:10.304439 7f24035e2800  1 <cls> 
> cls/log/cls_log.cc:317: Loaded log class!
>     -9> 2017-05-24 10:23:10.307437 7f24035e2800  1 <cls> 
> cls/rgw/cls_rgw.cc:3359: Loaded rgw class!
>     -8> 2017-05-24 10:23:10.307768 7f24035e2800  1 <cls> 
> cls/timeindex/cls_timeindex.cc:259: Loaded timeindex class!
>     -7> 2017-05-24 10:23:10.307927 7f24035e2800  0 <cls> 
> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>     -6> 2017-05-24 10:23:10.308086 7f24035e2800  1 <cls> 
> cls/statelog/cls_statelog.cc:306: Loaded log class!
>     -5> 2017-05-24 10:23:10.315241 7f24035e2800  0 osd.1908 863035 crush map 
> has features 2234490552320, adjusting msgr requires for
>  clients
>     -4> 2017-05-24 10:23:10.315258 7f24035e2800  0 osd.1908 863035 crush map 
> has features 2234490552320 was 8705, adjusting msgr req
> uires for mons
>     -3> 2017-05-24 10:23:10.315267 7f24035e2800  0 osd.1908 863035 crush map 
> has features 2234490552320, adjusting msgr requires for
>  osds
>     -2> 2017-05-24 10:23:10.441444 7f24035e2800  0 osd.1908 863035 load_pgs
>     -1> 2017-05-24 10:23:10.442608 7f24035e2800 -1 osd.1908 863035 load_pgs: 
> have pgid 11.3f5a at epoch 863078, but missing map.  Crashing.
>      0> 2017-05-24 10:23:10.444151 7f24035e2800 -1 osd/OSD.cc 
> <http://osd.cc/>: In function 'void OSD::load_pgs()' thread 7f24035e2800 time 
> 2017-05-24 10:23:10.442617
> osd/OSD.cc <http://osd.cc/>: 3189: FAILED assert(0 == "Missing map in 
> load_pgs")
> 
>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x8b) [0x55d1874be6db]
>  2: (OSD::load_pgs()+0x1f9b) [0x55d186e6a26b]
>  3: (OSD::init()+0x1f74) [0x55d186e7aec4]
>  4: (main()+0x29d1) [0x55d186de1d71]
>  5: (__libc_start_main()+0xf5) [0x7f24004fdf45]
>  6: (()+0x356a47) [0x55d186e2aa47]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
> interpret this.
> 
> Regards
> 
> Stuart Harland
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to