Dear all, I've solved the issue. Turns out my CRUSH map was a bit wonky. The weight of a datacenter bucket was not equal to the weight of all the osds below it. I must have edited it manually accidentally.
was -9 3 datacenter COM1 -6 6 room 02-WIRECEN -4 3 host ceph2 <snip> -2 3 host ceph1 <snip> should be -9 6 datacenter COM1 -6 6 room 02-WIRECEN -4 3 host ceph2 <snip> -2 3 host ceph1 <snip> Moving a host away from the bucket and moving it back solved the problem. - WP On Fri, Jan 10, 2014 at 12:22 PM, YIP Wai Peng <yi...@comp.nus.edu.sg>wrote: > Hi Wido, > > Thanks for the reply. I've dumped the query below. > > "recovery_state" doesn't say anything, there are also no missing or > unfounded objects. What else could be wrong? > > - WP > > P.S: I am running tunables optimal already. > > > { "state": "active+remapped", > "epoch": 6500, > "up": [ > 7], > "acting": [ > 7, > 3], > "info": { "pgid": "1.fa", > "last_update": "0'0", > "last_complete": "0'0", > "log_tail": "0'0", > "last_user_version": 0, > "last_backfill": "MAX", > "purged_snaps": "[]", > "history": { "epoch_created": 1, > "last_epoch_started": 6377, > "last_epoch_clean": 6379, > "last_epoch_split": 0, > "same_up_since": 6365, > "same_interval_since": 6365, > "same_primary_since": 6348, > "last_scrub": "0'0", > "last_scrub_stamp": "2014-01-09 11:37:18.202247", > "last_deep_scrub": "0'0", > "last_deep_scrub_stamp": "2014-01-09 11:37:18.202247", > "last_clean_scrub_stamp": "2014-01-09 11:37:18.202247"}, > "stats": { "version": "0'0", > "reported_seq": "4320", > "reported_epoch": "6500", > "state": "active+remapped", > "last_fresh": "2014-01-10 12:19:46.219163", > "last_change": "2014-01-10 11:18:53.147842", > "last_active": "2014-01-10 12:19:46.219163", > "last_clean": "2014-01-09 22:02:41.243761", > "last_became_active": "0.000000", > "last_unstale": "2014-01-10 12:19:46.219163", > "mapping_epoch": 6351, > "log_start": "0'0", > "ondisk_log_start": "0'0", > "created": 1, > "last_epoch_clean": 6379, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "0'0", > "last_scrub_stamp": "2014-01-09 11:37:18.202247", > "last_deep_scrub": "0'0", > "last_deep_scrub_stamp": "2014-01-09 11:37:18.202247", > "last_clean_scrub_stamp": "2014-01-09 11:37:18.202247", > "log_size": 0, > "ondisk_log_size": 0, > "stats_invalid": "0", > "stat_sum": { "num_bytes": 0, > "num_objects": 0, > "num_object_clones": 0, > "num_object_copies": 0, > "num_objects_missing_on_primary": 0, > "num_objects_degraded": 0, > "num_objects_unfound": 0, > "num_read": 0, > "num_read_kb": 0, > "num_write": 0, > "num_write_kb": 0, > "num_scrub_errors": 0, > "num_shallow_scrub_errors": 0, > "num_deep_scrub_errors": 0, > "num_objects_recovered": 0, > "num_bytes_recovered": 0, > "num_keys_recovered": 0}, > "stat_cat_sum": {}, > "up": [ > 7], > "acting": [ > 7, > 3]}, > "empty": 1, > "dne": 0, > "incomplete": 0, > "last_epoch_started": 6377}, > "recovery_state": [ > { "name": "Started\/Primary\/Active", > "enter_time": "2014-01-10 11:18:53.147802", > "might_have_unfound": [], > "recovery_progress": { "backfill_target": -1, > "waiting_on_backfill": 0, > "last_backfill_started": "0\/\/0\/\/-1", > "backfill_info": { "begin": "0\/\/0\/\/-1", > "end": "0\/\/0\/\/-1", > "objects": []}, > "peer_backfill_info": { "begin": "0\/\/0\/\/-1", > "end": "0\/\/0\/\/-1", > "objects": []}, > "backfills_in_flight": [], > "recovering": [], > "pg_backend": { "pull_from_peer": [], > "pushing": []}}, > "scrub": { "scrubber.epoch_start": "4757", > "scrubber.active": 0, > "scrubber.block_writes": 0, > "scrubber.finalizing": 0, > "scrubber.waiting_on": 0, > "scrubber.waiting_on_whom": []}}, > { "name": "Started", > "enter_time": "2014-01-10 11:18:40.137868"}]} > > > > On Fri, Jan 10, 2014 at 12:16 PM, Wido den Hollander <w...@42on.com>wrote: > >> On 01/10/2014 05:13 AM, YIP Wai Peng wrote: >> >>> Dear all, >>> >>> I have some pgs that are stuck_unclean, I'm trying to understand why. >>> Hopefully someone can help me shed some light on it. >>> >>> For example, one of them is >>> >>> # ceph pg dump_stuck unclean >>> 1.fa0000000active+remapped2014-01-10 >>> 11:18:53.1478420'06452:4272[7][7,3]0'02014-01-09 >>> 11:37:18.2022470'02014-01-09 11:37:18.202247 >>> >>> >>> >>> My pool 1 looks like this >>> >>> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 3 object_hash >>> rjenkins pg_num 256 pgp_num 256 last_change 2605 owner 0 >>> >>> >>> The rule 3 is >>> >>> rule different_host { >>> ruleset 3 >>> type replicated >>> min_size 1 >>> max_size 10 >>> step take default >>> step chooseleaf firstn 0 type host >>> step emit >>> } >>> >>> >>> My osd tree looks like >>> >>> # idweighttype nameup/downreweight >>> -140root default >>> -73datacenter CR2 >>> -53host ceph3 >>> 61osd.6up1 >>> 71osd.7up1 >>> 81osd.8up1 >>> <snip> >>> -93datacenter COM1 >>> -66room 02-WIRECEN >>> -43host ceph2 >>> 31osd.3up1 >>> 41osd.4up1 >>> 51osd.5up1 >>> >>> >>> osd.7 and osd.3 are in different hosts, so the rule is satisfied. Why is >>> it still in the 'remapped' status, and what is it waiting for? >>> >>> >> Try: >> >> $ ceph pg 1.fa query >> >> That will tell you the cause of why the PG is stuck. >> >> - Peng >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> >> -- >> Wido den Hollander >> 42on B.V. >> >> Phone: +31 (0)20 700 9902 >> Skype: contact42on >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com