Can you please provide the output of `ceph status`, `ceph osd tree`, and `ceph health detail`? Thank you.
On Tue, Sep 19, 2017 at 2:59 PM Jonas Jaszkowic < jonasjaszkowic.w...@gmail.com> wrote: > Hi all, > > I have setup a Ceph cluster consisting of one monitor, 32 OSD hosts (1 OSD > of size 320GB per host) and 16 clients which are reading > and writing to the cluster. I have one erasure coded pool (shec plugin) > with k=8, m=4, c=3 and pg_num=256. Failure domain is host. > I am able to reach a HEALTH_OK state and everything is working as > expected. The pool was populated with > 114048 files of different sizes ranging from 1kB to 4GB. Total amount of > data in the pool was around 3TB. The capacity of the > pool was around 10TB. > > I want to evaluate how Ceph is rebalancing data in case of an OSD loss > while clients are still reading. To do so, I am killing one OSD on purpose > via *ceph osd out <osd-id> *without adding a new one, i.e. I have 31 OSDs > left. Ceph seems to notice this failure and starts to rebalance data > which I can observe with the *ceph -w *command. > > However, Ceph failed to rebalance the data. The recovering process seemed > to be stuck at a random point. I waited more than 12h but the > number of degraded objects did not reduce and some PGs were stuck. Why is > this happening? Based on the number of OSDs and the k,m,c values > there should be enough hosts and OSDs to be able to recover from a single > OSD failure? > > Thank you in advance! > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com