CEPH 12.2.11, pool size 3, min_size 2.

One node went down today(private network interface started flapping, and
after a while OSD processes crashed), no big deal, cluster recovered, but
not completely. 1 PG stuck in active+clean+remapped state.

PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
 LOG  DISK_LOG STATE                 STATE_STAMP                VERSION
     REPORTED        UP         UP_PRIMARY ACTING     ACTING_PRIMARY
LAST_SCRUB      SCRUB_STAMP                LAST_DEEP_SCRUB
DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN
20.a2       511                  0        0       511       0  1584410172
1500     1500 active+clean+remapped 2019-03-26 20:50:18.639452
96149'189204    96861:935872    [26,14]         26  [26,14,9]
 26    96149'189204 2019-03-26 10:47:36.174769    95989'187669 2019-03-22
23:29:02.322848             0

it states it's placed on 26,14 OSDs, should be on 26,14,9. As far as I can
see there's nothing wrong with any of those OSDs, they work, host other
PGs, peer with each other, etc. I tried restarting all of them one after
another, but without any success.
OSD 9 hosts 95 other PGs, don't think it's PG overdose.

Last line of log from osd.9 mentioning PG 20.a2:
2019-03-26 20:50:16.294500 7fe27963a700  1 osd.9 pg_epoch: 96860 pg[20.a2(
v 96149'189204 (95989'187645,96149'189204] local-lis/les=96857/96858 n=511
ec=39164/39164 lis/c 96857/96855 les/c/f 96858/96856/66611
96859/96860/96855) [26,14]/[26,14,9] r=2 lpr=96860 pi=[96855,96860)/1
crt=96149'189204 lcod 0'0 remapped NOTIFY mbc={}] state<Start>:
transitioning to Stray

Nothing else out of ordinary, just usual scrubs/deep-scrubs notifications.
Any ideas what it it can be, or any other steps to troubleshoot this?
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to