Hi, I created this. http://paste.debian.net/999172/ But the expiration date is too short. So I did this too https://pastebin.com/QfrE71Dg.
What I want to mention is that there's no known cause for what's happening. It's true that time desynch happens on reboot because few millis skew. But ntp corrects it fast. There are no network issues and the log of the osd is in the output. I only see in other osd the errors that are becoming more and more usual: 2017-12-05 08:58:56.637773 7f0feff7f700 -1 log_channel(cluster) log [ERR] : 10.7a shard 2: soid 10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head data_digest 0xfae07534 != data_digest 0xe2de2a76 from auth oi 10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head(3873'5250781 client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv 5250781 dd e2de2a76 od ffffffff alloc_hint [0 0]) 2017-12-05 08:58:56.637775 7f0feff7f700 -1 log_channel(cluster) log [ERR] : 10.7a shard 6: soid 10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head data_digest 0xfae07534 != data_digest 0xe2de2a76 from auth oi 10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head(3873'5250781 client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv 5250781 dd e2de2a76 od ffffffff alloc_hint [0 0]) 2017-12-05 08:58:56.637777 7f0feff7f700 -1 log_channel(cluster) log [ERR] : 10.7a soid 10:5ff4f7a3:::rbd_data.56bf3a4775a618.0000000000002efa:head: failed to pick suitable auth object Digests not matching basically. Someone told me that this can be caused by a faulty disk. So I replaced the offending drive, and now I found the new disk is happening the same. Ok. But this thread is not for checking the source of the problem. This will be done later. This thread is to try recover an OSD that seems ok to the object store tool. This is: Why it breaks here? starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4 /var/lib/ceph/osd/ceph-4/journal osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f467ba0b8c0 time 2017-12-03 13:39:29.495311 osd/PG.cc: 3025: FAILED assert(values.size() == 2) ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5556eab28790] <--------- HERE 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x661) [0x5556ea4e6601] 3: (OSD::load_pgs()+0x75a) [0x5556ea43a8aa] 4: (OSD::init()+0x2026) [0x5556ea445ca6] 5: (main()+0x2ef1) [0x5556ea3b7301] 6: (__libc_start_main()+0xf0) [0x7f467886b830] 7: (_start()+0x29) [0x5556ea3f8b09] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2017-12-03 13:39:29.497091 7f467ba0b8c0 -1 osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f467ba0b8c0 time 2017-12-03 13:39:29.495311 osd/PG.cc: 3025: FAILED assert(values.size() == 2) So it looks like the offending code is this one: int r = store->omap_get_values(coll, pgmeta_oid, keys, &values); if (r == 0) { assert(values.size() == 2); <------ Here // sanity check version While the object store tool can run it without any problem. As you can see here: ceph-objectstore-tool --debug --op list-pgs --data-path /var/lib/ceph/osd/ceph-4 --journal-path /dev/sdf3 2017-12-05 09:18:25.885258 7f5dd8b94a40 0 filestore(/var/lib/ceph/osd/ceph-4) backend xfs (magic 0x58465342) 2017-12-05 09:18:25.885715 7f5dd8b94a40 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2017-12-05 09:18:25.885734 7f5dd8b94a40 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option 2017-12-05 09:18:25.885755 7f5dd8b94a40 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: splice is supported 2017-12-05 09:18:25.910484 7f5dd8b94a40 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2017-12-05 09:18:25.910545 7f5dd8b94a40 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_feature: extsize is disabled by conf 2017-12-05 09:18:26.639796 7f5dd8b94a40 0 filestore(/var/lib/ceph/osd/ceph-4) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2017-12-05 09:18:26.650560 7f5dd8b94a40 1 journal _open /dev/sdf3 fd 11: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1 2017-12-05 09:18:26.662606 7f5dd8b94a40 1 journal _open /dev/sdf3 fd 11: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1 2017-12-05 09:18:26.664869 7f5dd8b94a40 1 filestore(/var/lib/ceph/osd/ceph-4) upgrade Cluster fsid=9028f4da-0d77-462b-be9b-dbdf7fa57771 Supported features: compat={},rocompat={},incompat={1=initial feature set(~v.18),2=pginfo object,3=object locator,4=last_epoch_clean,5=categories,6=hobjectpool,7=biginfo,8=leveldbinfo,9=leveldblog,10=snapmapper,11=sharded objects,12=transaction hints,13=pg meta object} On-disk features: compat={},rocompat={},incompat={1=initial feature set(~v.18),2=pginfo object,3=object locator,4=last_epoch_clean,5=categories,6=hobjectpool,7=biginfo,8=leveldbinfo,9=leveldblog,10=snapmapper,11=sharded objects,12=transaction hints,13=pg meta object} Performing list-pgs operation .... On 04/12/17 12:21, Ronny Aasen wrote: > ceph health detail
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com