Hi.

When trying to take down a host for maintenance purposes I encountered an
I/O stall along with some PGs marked 'peered' unexpectedly.

Cluster stats: 96/96 OSDs, healthy prior to incident, 5120 PGs, 4 hosts
consisting of 24 OSDs each. Ceph version 11.2.0, using standard filestore
(with LVM journals on SSD) and default crush map. All pools are size 3,
min_size 2.

Steps to reproduce the problem:
0. Cluster is healthy, HEALTH_OK
1. Set noout flag to prepare for host removal.
2. Begin taking OSDs on one of the hosts down: systemctl stop ceph-osd@$osd.
3. Notice the IO has stalled unexpectedly and about 100 PGs total are in
degraded+undersized+peered state if the host is down.

AFAIK the 'peered' state means that the PG has not been replicated to
min_size yet, so there is something strange going on. Since we have 4 hosts
and are using the default crush map, how is it possible that after taking
one host (or even just some OSDs on that host) down some PGs in the cluster
are left with less than 2 copies?

Here's the snippet of 'ceph pg dump_stuck' when this happened. Sadly I
don't have any more information yet...

# ceph pg dump|grep peered
dumped all in format plain
3.c80       173                  0      346       692       0   715341824
10041    10041 undersized+degraded+remapped+backfill_wait+peered 2017-08-02
19:12:39.319222  12124'104727   12409:62777 [62,76,44]         62
 [2]              2    1642'32485 2017-07-18 22:57:06.263727
 1008'135 2017-07-09 22:34:40.893182
3.204       184                  0      368       649       0   769544192
10065    10065 undersized+degraded+remapped+backfill_wait+peered 2017-08-02
19:12:39.334905   12124'13665   12409:37345  [75,52,1]         75
 [2]              2     1375'4316 2017-07-18 00:10:27.601548
1371'2740 2017-07-12 07:48:34.953831
11.19     25525                  0    51050     78652       0 14829768529
10059    10059 undersized+degraded+remapped+backfill_wait+peered 2017-08-02
19:12:39.311612  12124'156267  12409:137128 [56,26,14]         56
[18]             18    1375'28148 2017-07-17 20:27:04.916079
0'0 2017-07-10 16:12:49.270606

-- 
Sincerely,
Yuri Gorshkov
Systems Engineer
SmartLabs LLC
+7 (495) 645-44-46 ext. 6926
ygorsh...@smartlabs.tv
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to