Hi All,

Is there a known procedure to debug the PG state in case of problems like
this?

Best regards,
Yuri.

2017-08-28 14:05 GMT+03:00 Yuri Gorshkov <ygorsh...@smartlabs.tv>:

> Hi.
>
> When trying to take down a host for maintenance purposes I encountered an
> I/O stall along with some PGs marked 'peered' unexpectedly.
>
> Cluster stats: 96/96 OSDs, healthy prior to incident, 5120 PGs, 4 hosts
> consisting of 24 OSDs each. Ceph version 11.2.0, using standard filestore
> (with LVM journals on SSD) and default crush map. All pools are size 3,
> min_size 2.
>
> Steps to reproduce the problem:
> 0. Cluster is healthy, HEALTH_OK
> 1. Set noout flag to prepare for host removal.
> 2. Begin taking OSDs on one of the hosts down: systemctl stop ceph-osd@
> $osd.
> 3. Notice the IO has stalled unexpectedly and about 100 PGs total are in
> degraded+undersized+peered state if the host is down.
>
> AFAIK the 'peered' state means that the PG has not been replicated to
> min_size yet, so there is something strange going on. Since we have 4 hosts
> and are using the default crush map, how is it possible that after taking
> one host (or even just some OSDs on that host) down some PGs in the cluster
> are left with less than 2 copies?
>
> Here's the snippet of 'ceph pg dump_stuck' when this happened. Sadly I
> don't have any more information yet...
>
> # ceph pg dump|grep peered
> dumped all in format plain
> 3.c80       173                  0      346       692       0   715341824
> 10041    10041 undersized+degraded+remapped+backfill_wait+peered
> 2017-08-02 19:12:39.319222  12124'104727   12409:62777 [62,76,44]
> 62        [2]              2    1642'32485 2017-07-18 22:57:06.263727
>  1008'135 2017-07-09 22:34:40.893182
> 3.204       184                  0      368       649       0   769544192
> 10065    10065 undersized+degraded+remapped+backfill_wait+peered
> 2017-08-02 19:12:39.334905   12124'13665   12409:37345  [75,52,1]
> 75        [2]              2     1375'4316 2017-07-18 00:10:27.601548
> 1371'2740 2017-07-12 07:48:34.953831
> 11.19     25525                  0    51050     78652       0 14829768529
> 10059    10059 undersized+degraded+remapped+backfill_wait+peered
> 2017-08-02 19:12:39.311612  12124'156267  12409:137128 [56,26,14]
> 56       [18]             18    1375'28148 2017-07-17 20:27:04.916079
>       0'0 2017-07-10 16:12:49.270606
>
> --
> Sincerely,
> Yuri Gorshkov
> Systems Engineer
> SmartLabs LLC
> +7 (495) 645-44-46 ext. 6926
> ygorsh...@smartlabs.tv
>
>


-- 
Sincerely,
Yuri Gorshkov
Systems Engineer
SmartLabs LLC
+7 (495) 645-44-46 ext. 6926
ygorsh...@smartlabs.tv
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to