Dear all,

I've solved the issue. Turns out my CRUSH map was a bit wonky. The weight
of a datacenter bucket was not equal to the weight of all the osds below
it. I must have edited it manually accidentally.

was

-9 3 datacenter COM1
-6 6 room 02-WIRECEN
-4 3 host ceph2
<snip>
-2 3 host ceph1
<snip>


should be

-9 6 datacenter COM1
-6 6 room 02-WIRECEN
-4 3 host ceph2
<snip>
-2 3 host ceph1
<snip>


Moving a host away from the bucket and moving it back solved the problem.

- WP


On Fri, Jan 10, 2014 at 12:22 PM, YIP Wai Peng <yi...@comp.nus.edu.sg>wrote:

> Hi Wido,
>
> Thanks for the reply. I've dumped the query below.
>
> "recovery_state" doesn't say anything, there are also no missing or
> unfounded objects. What else could be wrong?
>
> - WP
>
> P.S: I am running tunables optimal already.
>
>
> { "state": "active+remapped",
>   "epoch": 6500,
>   "up": [
>         7],
>   "acting": [
>         7,
>         3],
>   "info": { "pgid": "1.fa",
>       "last_update": "0'0",
>       "last_complete": "0'0",
>       "log_tail": "0'0",
>       "last_user_version": 0,
>       "last_backfill": "MAX",
>       "purged_snaps": "[]",
>       "history": { "epoch_created": 1,
>           "last_epoch_started": 6377,
>           "last_epoch_clean": 6379,
>           "last_epoch_split": 0,
>           "same_up_since": 6365,
>           "same_interval_since": 6365,
>           "same_primary_since": 6348,
>           "last_scrub": "0'0",
>           "last_scrub_stamp": "2014-01-09 11:37:18.202247",
>           "last_deep_scrub": "0'0",
>           "last_deep_scrub_stamp": "2014-01-09 11:37:18.202247",
>           "last_clean_scrub_stamp": "2014-01-09 11:37:18.202247"},
>       "stats": { "version": "0'0",
>           "reported_seq": "4320",
>           "reported_epoch": "6500",
>           "state": "active+remapped",
>           "last_fresh": "2014-01-10 12:19:46.219163",
>           "last_change": "2014-01-10 11:18:53.147842",
>           "last_active": "2014-01-10 12:19:46.219163",
>           "last_clean": "2014-01-09 22:02:41.243761",
>           "last_became_active": "0.000000",
>           "last_unstale": "2014-01-10 12:19:46.219163",
>           "mapping_epoch": 6351,
>           "log_start": "0'0",
>           "ondisk_log_start": "0'0",
>           "created": 1,
>           "last_epoch_clean": 6379,
>           "parent": "0.0",
>           "parent_split_bits": 0,
>           "last_scrub": "0'0",
>           "last_scrub_stamp": "2014-01-09 11:37:18.202247",
>           "last_deep_scrub": "0'0",
>           "last_deep_scrub_stamp": "2014-01-09 11:37:18.202247",
>           "last_clean_scrub_stamp": "2014-01-09 11:37:18.202247",
>           "log_size": 0,
>           "ondisk_log_size": 0,
>           "stats_invalid": "0",
>           "stat_sum": { "num_bytes": 0,
>               "num_objects": 0,
>               "num_object_clones": 0,
>               "num_object_copies": 0,
>               "num_objects_missing_on_primary": 0,
>               "num_objects_degraded": 0,
>               "num_objects_unfound": 0,
>               "num_read": 0,
>               "num_read_kb": 0,
>               "num_write": 0,
>               "num_write_kb": 0,
>               "num_scrub_errors": 0,
>               "num_shallow_scrub_errors": 0,
>               "num_deep_scrub_errors": 0,
>               "num_objects_recovered": 0,
>               "num_bytes_recovered": 0,
>               "num_keys_recovered": 0},
>           "stat_cat_sum": {},
>           "up": [
>                 7],
>           "acting": [
>                 7,
>                 3]},
>       "empty": 1,
>       "dne": 0,
>       "incomplete": 0,
>       "last_epoch_started": 6377},
>   "recovery_state": [
>         { "name": "Started\/Primary\/Active",
>           "enter_time": "2014-01-10 11:18:53.147802",
>           "might_have_unfound": [],
>           "recovery_progress": { "backfill_target": -1,
>               "waiting_on_backfill": 0,
>               "last_backfill_started": "0\/\/0\/\/-1",
>               "backfill_info": { "begin": "0\/\/0\/\/-1",
>                   "end": "0\/\/0\/\/-1",
>                   "objects": []},
>               "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
>                   "end": "0\/\/0\/\/-1",
>                   "objects": []},
>               "backfills_in_flight": [],
>               "recovering": [],
>               "pg_backend": { "pull_from_peer": [],
>                   "pushing": []}},
>           "scrub": { "scrubber.epoch_start": "4757",
>               "scrubber.active": 0,
>               "scrubber.block_writes": 0,
>               "scrubber.finalizing": 0,
>               "scrubber.waiting_on": 0,
>               "scrubber.waiting_on_whom": []}},
>         { "name": "Started",
>           "enter_time": "2014-01-10 11:18:40.137868"}]}
>
>
>
> On Fri, Jan 10, 2014 at 12:16 PM, Wido den Hollander <w...@42on.com>wrote:
>
>> On 01/10/2014 05:13 AM, YIP Wai Peng wrote:
>>
>>> Dear all,
>>>
>>> I have some pgs that are stuck_unclean, I'm trying to understand why.
>>> Hopefully someone can help me shed some light on it.
>>>
>>> For example, one of them is
>>>
>>> # ceph pg dump_stuck unclean
>>> 1.fa0000000active+remapped2014-01-10
>>> 11:18:53.1478420'06452:4272[7][7,3]0'02014-01-09
>>> 11:37:18.2022470'02014-01-09 11:37:18.202247
>>>
>>>
>>>
>>> My pool 1 looks like this
>>>
>>> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 3 object_hash
>>> rjenkins pg_num 256 pgp_num 256 last_change 2605 owner 0
>>>
>>>
>>> The rule 3 is
>>>
>>> rule different_host {
>>>          ruleset 3
>>>          type replicated
>>>          min_size 1
>>>          max_size 10
>>>          step take default
>>>          step chooseleaf firstn 0 type host
>>>          step emit
>>> }
>>>
>>>
>>> My osd tree looks like
>>>
>>> # idweighttype nameup/downreweight
>>> -140root default
>>> -73datacenter CR2
>>> -53host ceph3
>>> 61osd.6up1
>>> 71osd.7up1
>>> 81osd.8up1
>>> <snip>
>>> -93datacenter COM1
>>> -66room 02-WIRECEN
>>> -43host ceph2
>>> 31osd.3up1
>>> 41osd.4up1
>>> 51osd.5up1
>>>
>>>
>>> osd.7 and osd.3 are in different hosts, so the rule is satisfied. Why is
>>> it still in the 'remapped' status, and what is it waiting for?
>>>
>>>
>> Try:
>>
>> $ ceph pg 1.fa query
>>
>> That will tell you the cause of why the PG is stuck.
>>
>>  - Peng
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to