This is just a followup for those who will encounter similar problem. Originally this was a pool with only 4 nodes, size 3, min_size 2, big node/osd weight difference(node weights 10, 2, 4, 4, osd weights from 2.5 to 0.5. detailed CRUSH map below(only 3 nodes left, issue persisted at this point)[1]) When we exclude one of smaller nodes from the pool - this issue appeares.
It turned out that new mapping of [26,14,9] tried to put PG on the same node twice, which is conflicting with CRUSH rule for the pool[2]. osd.26 and osd.9 are residing on the same node, and rule instructs to place a PG copy on a separate node. For some reason cluster was not able to do that, thought it has required amount of nodes. Anyway, I've googled a similar issue[3], and there was mentioning that weight difference can be an issue. So we took out one osd from the fat node, and new mapping worked fine, issue desappeared. I guess CRUSH algorithm can't handle some extreme weight differences, which is to be expected(?). [1] host backup1 { id -19 # do not change unnecessarily id -41 class hdd # do not change unnecessarily id -31 class ssd # do not change unnecessarily # weight 10.920 alg straw2 hash 0 # rjenkins1 item osd.19 weight 2.730 item osd.35 weight 2.730 item osd.13 weight 2.730 item osd.14 weight 2.730 } host backup2 { id -20 # do not change unnecessarily id -42 class hdd # do not change unnecessarily id -32 class ssd # do not change unnecessarily # weight 2.544 alg straw2 hash 0 # rjenkins1 item osd.33 weight 0.545 item osd.36 weight 0.545 item osd.12 weight 0.545 item osd.34 weight 0.909 } host backup3 { id -22 # do not change unnecessarily id -43 class hdd # do not change unnecessarily id -36 class ssd # do not change unnecessarily # weight 4.361 alg straw2 hash 0 # rjenkins1 item osd.29 weight 0.545 item osd.22 weight 0.545 item osd.28 weight 0.545 item osd.24 weight 0.545 item osd.26 weight 0.545 item osd.20 weight 0.546 item osd.9 weight 0.545 item osd.21 weight 0.545 } root backups { id -21 # do not change unnecessarily id -30 class hdd # do not change unnecessarily id -40 class ssd # do not change unnecessarily # weight 17.825 alg straw2 hash 0 # rjenkins1 item backup1 weight 10.920 item backup2 weight 2.544 item backup3 weight 4.361 } [2] rule backups-rule { id 3 type replicated min_size 1 max_size 10 step take backups step chooseleaf firstn 0 type host step emit } [3] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015550.html пн, 1 апр. 2019 г. в 12:23, Vladimir Prokofev <v...@prokofev.me>: > As we fixed failed node next day, cluster rebalanced to it's original > state without any issues, so crush dump would be irrelevant at this point I > guess. Will have to wait for next occurence. > Here's a tunables part, maybe it will help to shed some light: > > "tunables": { > "choose_local_tries": 0, > "choose_local_fallback_tries": 0, > "choose_total_tries": 50, > "chooseleaf_descend_once": 1, > "chooseleaf_vary_r": 1, > "chooseleaf_stable": 0, > "straw_calc_version": 1, > "allowed_bucket_algs": 22, > "profile": "firefly", > "optimal_tunables": 0, > "legacy_tunables": 0, > "minimum_required_version": "firefly", > "require_feature_tunables": 1, > "require_feature_tunables2": 1, > "has_v2_rules": 0, > "require_feature_tunables3": 1, > "has_v3_rules": 0, > "has_v4_buckets": 0, > "require_feature_tunables5": 0, > "has_v5_rules": 0 > }, > > вс, 31 мар. 2019 г. в 13:28, huang jun <hjwsm1...@gmail.com>: > >> seems like the crush cannot get enough osds for this pg, >> what the output of 'ceph osd crush dump' and especially the 'tunables' >> section values? >> >> Vladimir Prokofev <v...@prokofev.me> 于2019年3月27日周三 上午4:02写道: >> > >> > CEPH 12.2.11, pool size 3, min_size 2. >> > >> > One node went down today(private network interface started flapping, >> and after a while OSD processes crashed), no big deal, cluster recovered, >> but not completely. 1 PG stuck in active+clean+remapped state. >> > >> > PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES >> LOG DISK_LOG STATE STATE_STAMP VERSION >> REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY >> LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB >> DEEP_SCRUB_STAMP SNAPTRIMQ_LEN >> > 20.a2 511 0 0 511 0 >> 1584410172 1500 1500 active+clean+remapped 2019-03-26 20:50:18.639452 >> 96149'189204 96861:935872 [26,14] 26 [26,14,9] >> 26 96149'189204 2019-03-26 10:47:36.174769 95989'187669 2019-03-22 >> 23:29:02.322848 0 >> > >> > it states it's placed on 26,14 OSDs, should be on 26,14,9. As far as I >> can see there's nothing wrong with any of those OSDs, they work, host other >> PGs, peer with each other, etc. I tried restarting all of them one after >> another, but without any success. >> > OSD 9 hosts 95 other PGs, don't think it's PG overdose. >> > >> > Last line of log from osd.9 mentioning PG 20.a2: >> > 2019-03-26 20:50:16.294500 7fe27963a700 1 osd.9 pg_epoch: 96860 >> pg[20.a2( v 96149'189204 (95989'187645,96149'189204] >> local-lis/les=96857/96858 n=511 ec=39164/39164 lis/c 96857/96855 les/c/f >> 96858/96856/66611 96859/96860/96855) [26,14]/[26,14,9] r=2 lpr=96860 >> pi=[96855,96860)/1 crt=96149'189204 lcod 0'0 remapped NOTIFY mbc={}] >> state<Start>: transitioning to Stray >> > >> > Nothing else out of ordinary, just usual scrubs/deep-scrubs >> notifications. >> > Any ideas what it it can be, or any other steps to troubleshoot this? >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> -- >> Thank you! >> HuangJun >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com