My first guess is to ask what your crush rules are. `ceph osd crush rule dump` along with `ceph osd pool ls detail` would be helpful. Also if you have a `ceph status` output from a time where the VM RBDs aren't working might explain something.
On Thu, Oct 11, 2018 at 1:12 PM Nils Fahldieck - Profihost AG < n.fahldi...@profihost.ag> wrote: > Hi everyone, > > since some time we experience service outages in our Ceph cluster > whenever there is any change to the HEALTH status. E. g. swapping > storage devices, adding storage devices, rebooting Ceph hosts, during > backfills ect. > > Just now I had a recent situation, where several VMs hung after I > rebooted one Ceph host. We have 3 replications for each PG, 3 mon, 3 > mgr, 3 mds and 71 osds spread over 9 hosts. > > We use Ceph as a storage backend for our Proxmox VE (PVE) environment. > The outages are in the form of blocked virtual file systems of those > virtual machines running in our PVE cluster. > > It feels similar to stuck and inactive PGs to me. Honestly though I'm > not really sure on how to debug this problem or which log files to examine. > > OS: Debian 9 > Kernel: 4.12 based upon SLE15-SP1 > > # ceph version > ceph version 12.2.8-133-gded2f6836f > (ded2f6836f6331a58f5c817fca7bfcd6c58795aa) luminous (stable) > > Can someone guide me? I'm more than happy to provide more information > as needed. > > Thanks in advance > Nils > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com