On 2020-08-24 20:35, Mathijs Smit wrote: > Hi everyone, > > I have a serious problem which currently exists of my entire Ceph no longer > being able to provide service. As if yesterday I added 10 OSD's total 2 per > node, the rebalance started and took some IO but seemed to be doing its work. > This morning the cluster was still processing the rebalance and taking so > much IO that nearly all OSD's where marked as "slow ops" and from there > everything went wrong. As attempt to clear as much IO for de rebalance I > stoped all the clients and waited for the rebalance to finish. After it > finished the cluster remained extremely slow and unusable. Whilst trying to > debug I restarted several services and nodes trying to find the problem. Now > the cluster has entered a state where multiple OSD's remain slow, various > OSD's show a "BADAUTHORIZER" message and the mgr on all nodes also has issues > "verify_authorizer". > > I verified all the clocks on all servers and they are sinked to the same NTP > service and seem good. > > Please please please advise as straight 13 hours of debugging got me nowhere.
If you can pause all client IO, I don't think it can harm to upgrade everything to 14.2.11, just to be sure you don't hit a bug that is already fixed. But it depends on the current health of your cluster I guess. Gr. Stefan _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io