Hi,

The host that is taken down has 12 disks in it?

Have a look at the down PG's '18 pgs down' - I suspect this will be what is
causing the I/O freeze.

Is your cursh map setup correctly to split data over different hosts?

Thanks

On Tue, Sep 13, 2016 at 11:45 AM, Daznis <daz...@gmail.com> wrote:

> No, no errors about that. I have set noout before it happened, but it
> still started recovery. I have added
> nobackfill,norebalance,norecover,noscrub,nodeep-scrub once i noticed
> it started doing crazy stuff. So recovery I/O stopped but the cluster
> can't read any info. Only writes to cache layer.
>
>     cluster cdca2074-4c91-4047-a607-faebcbc1ee17
>      health HEALTH_WARN
>             2225 pgs degraded
>             18 pgs down
>             18 pgs peering
>             89 pgs stale
>             2225 pgs stuck degraded
>             18 pgs stuck inactive
>             89 pgs stuck stale
>             2257 pgs stuck unclean
>             2225 pgs stuck undersized
>             2225 pgs undersized
>             recovery 4180820/11837906 objects degraded (35.317%)
>             recovery 24016/11837906 objects misplaced (0.203%)
>             12/39 in osds are down
>             noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub
> flag(s) set
>      monmap e9: 7 mons at {}
>             election epoch 170, quorum 0,1,2,3,4,5,6
>      osdmap e40290: 40 osds: 27 up, 39 in; 14 remapped pgs
>             flags noout,nobackfill,norebalance,
> norecover,noscrub,nodeep-scrub
>       pgmap v39326300: 4096 pgs, 4 pools, 21455 GB data, 5780 kobjects
>             42407 GB used, 75772 GB / 115 TB avail
>             4180820/11837906 objects degraded (35.317%)
>             24016/11837906 objects misplaced (0.203%)
>                 2136 active+undersized+degraded
>                 1837 active+clean
>                   89 stale+active+undersized+degraded
>                   18 down+peering
>                   14 active+remapped
>                    2 active+clean+scrubbing+deep
>   client io 0 B/s rd, 9509 kB/s wr, 3469 op/s
>
> On Tue, Sep 13, 2016 at 1:34 PM, M Ranga Swami Reddy
> <swamire...@gmail.com> wrote:
> > Please check if any osd is nearfull ERR. Can you please share the ceph -s
> > o/p?
> >
> > Thanks
> > Swami
> >
> > On Tue, Sep 13, 2016 at 3:54 PM, Daznis <daz...@gmail.com> wrote:
> >>
> >> Hello,
> >>
> >>
> >> I have encountered a strange I/O freeze while rebooting one OSD node
> >> for maintenance purpose. It was one of the 3 Nodes in the entire
> >> cluster. Before this rebooting or shutting down and entire node just
> >> slowed down the ceph, but not completely froze it.
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to