1) No errors at all. At loglevel 20 the osd does not say anything about the
missing placement group
2) I tried that. Several times actually, also for the secondary osd's, but
it does not work.

gr,
Bart

On Tue, Aug 18, 2015 at 4:28 AM minchen <minche...@outlook.com> wrote:

>
> osd.19 is blocked by pg creating and 19 client ops,
> 1. check osd.19's log to see if any errors
> 2. if not, out 19 from osdmap to remap pg 5.6c7
>     ceph osd out 19 // this will cause data migration
> I am not sure whether this will help you!
>
>
> ------------------ Original ------------------
> *From: * "Bart Vanbrabant";<b...@vanbrabant.eu>;
> *Date: * Mon, Aug 17, 2015 10:14 PM
> *To: * "minchen"<minche...@outlook.com>; "ceph-users"<
> ceph-users@lists.ceph.com>;
> *Subject: * Re: [ceph-users] Stuck creating pg
>
> 1)
>
> ~# ceph pg 5.6c7 query
> Error ENOENT: i don't have pgid 5.6c7
>
> In the osd log:
>
> 2015-08-17 16:11:45.185363 7f311be40700  0 osd.19 64706 do_command r=-2 i
> don't have pgid 5.6c7
> 2015-08-17 16:11:45.185380 7f311be40700  0 log_channel(cluster) log [INF]
> : i don't have pgid 5.6c7
>
> 2) I do not see anything wrong with this rule:
>
>     {
>         "rule_id": 0,
>         "rule_name": "data",
>         "ruleset": 0,
>         "type": 1,
>         "min_size": 1,
>         "max_size": 10,
>         "steps": [
>             {
>                 "op": "take",
>                 "item": -1,
>                 "item_name": "default"
>             },
>             {
>                 "op": "chooseleaf_firstn",
>                 "num": 0,
>                 "type": "host"
>             },
>             {
>                 "op": "emit"
>             }
>         ]
>     },
>
> 3) I rebooted all machines in the cluster and increased the replication
> level of the affected pool to 3, to be more sure.  After recovery of this
> reboot we are currently in the current state:
>
> HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean; 103 requests are
> blocked > 32 sec; 2 osds have slow requests; pool volumes pg_num 2048 >
> pgp_num 1400
> pg 5.6c7 is stuck inactive since forever, current state creating, last
> acting [19,25,17]
> pg 5.6c7 is stuck unclean since forever, current state creating, last
> acting [19,25,17]
> 103 ops are blocked > 524.288 sec
> 19 ops are blocked > 524.288 sec on osd.19
> 84 ops are blocked > 524.288 sec on osd.25
> 2 osds have slow requests
> pool volumes pg_num 2048 > pgp_num 1400
>
> Thanks,
>
> Bart
>
> On 08/17/2015 03:44 PM, minchen wrote:
>
>
> It looks like the crushrule does't work properly by osdmap changed,
>  there are 3 pgs unclean: 5.6c7  5.2c7  15.2bd
> I think you can try follow method to help locate the problem
> 1st,  ceph pg <pgid> query to lookup detail of pg state,
>     eg, blocked by which osd?
> 2st, check the crushrule
>     ceph osd crush rule dump
>     and check the crush_ruleset for pools: 5 , 15
>     eg,  the chooseleaf may be not choose the right osd ?
>
> minchen
> ------------------ Original ------------------
> *From: * "Bart Vanbrabant";<b...@vanbrabant.eu> <b...@vanbrabant.eu>;
> *Date: * Sun, Aug 16, 2015 07:27 PM
> *To: * "ceph-users"<ceph-users@lists.ceph.com> <ceph-users@lists.ceph.com>;
>
> *Subject: * [ceph-users] Stuck creating pg
>
> Hi,
>
> I have a ceph cluster with 26 osd's in 4 hosts only use for rbd for an
> OpenStack cluster (started at 0.48 I think), currently running 0.94.2 on
> Ubuntu 14.04. A few days ago one of the osd's was at 85% disk usage while
> only 30% of the raw disk space is used. I ran reweight-by-utilization with
> 150 was cutoff level. This reshuffled the data. I also noticed that the
> number of pg was still at the level when there were less disks in the
> cluster (1300).
>
> Based on the current guidelines I increased pg_num to 2048. It created the
> placement groups except for the last one. To try to force the creation of
> the pg I removed the OSD's (ceph osd out) assigned to that pg but that
> makes no difference. Currently all OSD's are back in and two pg's are also
> stuck in an unclean state:
>
> ceph health detail:
>
> HEALTH_WARN 2 pgs degraded; 2 pgs stale; 2 pgs stuck degraded; 1 pgs stuck
> inactive; 2 pgs stuck stale; 3 pgs stuck unclean; 2 pgs stuck undersized; 2
> pgs undersized; 59 requests are blocked > 32 sec; 3 osds have slow
> requests; recovery 221/549658 objects degraded (0.040%); recovery
> 221/549658 objects misplaced (0.040%); pool volumes pg_num 2048 > pgp_num
> 1400
> pg 5.6c7 is stuck inactive since forever, current state creating, last
> acting [19,25]
> pg 5.6c7 is stuck unclean since forever, current state creating, last
> acting [19,25]
> pg 5.2c7 is stuck unclean for 313513.609864, current state
> stale+active+undersized+degraded+remapped, last acting [9]
> pg 15.2bd is stuck unclean for 313513.610368, current state
> stale+active+undersized+degraded+remapped, last acting [9]
> pg 5.2c7 is stuck undersized for 308381.750768, current state
> stale+active+undersized+degraded+remapped, last acting [9]
> pg 15.2bd is stuck undersized for 308381.751913, current state
> stale+active+undersized+degraded+remapped, last acting [9]
> pg 5.2c7 is stuck degraded for 308381.750876, current state
> stale+active+undersized+degraded+remapped, last acting [9]
> pg 15.2bd is stuck degraded for 308381.752021, current state
> stale+active+undersized+degraded+remapped, last acting [9]
> pg 5.2c7 is stuck stale for 281750.295301, current state
> stale+active+undersized+degraded+remapped, last acting [9]
> pg 15.2bd is stuck stale for 281750.295293, current state
> stale+active+undersized+degraded+remapped, last acting [9]
> 16 ops are blocked > 268435 sec
> 10 ops are blocked > 134218 sec
> 10 ops are blocked > 1048.58 sec
> 23 ops are blocked > 524.288 sec
> 16 ops are blocked > 268435 sec on osd.1
> 8 ops are blocked > 134218 sec on osd.17
> 2 ops are blocked > 134218 sec on osd.19
> 10 ops are blocked > 1048.58 sec on osd.19
> 23 ops are blocked > 524.288 sec on osd.19
> 3 osds have slow requests
> recovery 221/549658 objects degraded (0.040%)
> recovery 221/549658 objects misplaced (0.040%)
> pool volumes pg_num 2048 > pgp_num 1400
>
> OSD 9 was the one that was the primary when the pg creation process got
> stuck. This OSD has been removed and added again (not only osd out but also
> removed from the crush map and added again)
>
> The bad data distribution was probably caused by the low number of pg's
> and mainly bad weighing of the OSD. I changed the crush map to give the
> same weight to each of the OSD's but that does not change these problems
> either:
>
> ceph osd tree:
> ID WEIGHT  TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 6.50000 pool default
> -6 2.00000     host droplet4
> 16 0.25000         osd.16         up  1.00000          1.00000
> 20 0.25000         osd.20         up  1.00000          1.00000
> 21 0.25000         osd.21         up  1.00000          1.00000
> 22 0.25000         osd.22         up  1.00000          1.00000
>  6 0.25000         osd.6          up  1.00000          1.00000
> 18 0.25000         osd.18         up  1.00000          1.00000
> 19 0.25000         osd.19         up  1.00000          1.00000
> 23 0.25000         osd.23         up  1.00000          1.00000
> -5 1.50000     host droplet3
>  3 0.25000         osd.3          up  1.00000          1.00000
> 13 0.25000         osd.13         up  1.00000          1.00000
> 15 0.25000         osd.15         up  1.00000          1.00000
>  4 0.25000         osd.4          up  1.00000          1.00000
> 25 0.25000         osd.25         up  1.00000          1.00000
> 14 0.25000         osd.14         up  1.00000          1.00000
> -2 1.50000     host droplet1
>  7 0.25000         osd.7          up  1.00000          1.00000
>  1 0.25000         osd.1          up  1.00000          1.00000
>  0 0.25000         osd.0          up  1.00000          1.00000
>  9 0.25000         osd.9          up  1.00000          1.00000
> 12 0.25000         osd.12         up  1.00000          1.00000
> 17 0.25000         osd.17         up  1.00000          1.00000
> -4 1.50000     host droplet2
> 10 0.25000         osd.10         up  1.00000          1.00000
>  8 0.25000         osd.8          up  1.00000          1.00000
> 11 0.25000         osd.11         up  1.00000          1.00000
>  2 0.25000         osd.2          up  1.00000          1.00000
> 24 0.25000         osd.24         up  1.00000          1.00000
>  5 0.25000         osd.5          up  1.00000          1.00000
>
> I also restarted all OSD's and monitors several times, but no change. The
> pool for which the pg is stuck, has replication level 2. I ran out of
> things to try. Anyone else something I can try?
>
> gr,
> Bart
>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to