Hi All, Is there a known procedure to debug the PG state in case of problems like this?
Best regards, Yuri. 2017-08-28 14:05 GMT+03:00 Yuri Gorshkov <ygorsh...@smartlabs.tv>: > Hi. > > When trying to take down a host for maintenance purposes I encountered an > I/O stall along with some PGs marked 'peered' unexpectedly. > > Cluster stats: 96/96 OSDs, healthy prior to incident, 5120 PGs, 4 hosts > consisting of 24 OSDs each. Ceph version 11.2.0, using standard filestore > (with LVM journals on SSD) and default crush map. All pools are size 3, > min_size 2. > > Steps to reproduce the problem: > 0. Cluster is healthy, HEALTH_OK > 1. Set noout flag to prepare for host removal. > 2. Begin taking OSDs on one of the hosts down: systemctl stop ceph-osd@ > $osd. > 3. Notice the IO has stalled unexpectedly and about 100 PGs total are in > degraded+undersized+peered state if the host is down. > > AFAIK the 'peered' state means that the PG has not been replicated to > min_size yet, so there is something strange going on. Since we have 4 hosts > and are using the default crush map, how is it possible that after taking > one host (or even just some OSDs on that host) down some PGs in the cluster > are left with less than 2 copies? > > Here's the snippet of 'ceph pg dump_stuck' when this happened. Sadly I > don't have any more information yet... > > # ceph pg dump|grep peered > dumped all in format plain > 3.c80 173 0 346 692 0 715341824 > 10041 10041 undersized+degraded+remapped+backfill_wait+peered > 2017-08-02 19:12:39.319222 12124'104727 12409:62777 [62,76,44] > 62 [2] 2 1642'32485 2017-07-18 22:57:06.263727 > 1008'135 2017-07-09 22:34:40.893182 > 3.204 184 0 368 649 0 769544192 > 10065 10065 undersized+degraded+remapped+backfill_wait+peered > 2017-08-02 19:12:39.334905 12124'13665 12409:37345 [75,52,1] > 75 [2] 2 1375'4316 2017-07-18 00:10:27.601548 > 1371'2740 2017-07-12 07:48:34.953831 > 11.19 25525 0 51050 78652 0 14829768529 > 10059 10059 undersized+degraded+remapped+backfill_wait+peered > 2017-08-02 19:12:39.311612 12124'156267 12409:137128 [56,26,14] > 56 [18] 18 1375'28148 2017-07-17 20:27:04.916079 > 0'0 2017-07-10 16:12:49.270606 > > -- > Sincerely, > Yuri Gorshkov > Systems Engineer > SmartLabs LLC > +7 (495) 645-44-46 ext. 6926 > ygorsh...@smartlabs.tv > > -- Sincerely, Yuri Gorshkov Systems Engineer SmartLabs LLC +7 (495) 645-44-46 ext. 6926 ygorsh...@smartlabs.tv
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com