On 2020-08-24 20:35, Mathijs Smit wrote:
> Hi everyone,
> 
> I have a serious problem which currently exists of my entire Ceph no longer 
> being able to provide service. As if yesterday I added 10 OSD's total 2 per 
> node, the rebalance started and took some IO but seemed to be doing its work. 
> This morning the cluster was still processing the rebalance and taking so 
> much IO that nearly all OSD's where marked as "slow ops" and from there 
> everything went wrong. As attempt to clear as much IO for de rebalance I 
> stoped all the clients and waited for the rebalance to finish. After it 
> finished the cluster remained extremely slow and unusable. Whilst trying to 
> debug I restarted several services and nodes trying to find the problem. Now 
> the cluster has entered a state where multiple OSD's remain slow, various 
> OSD's show a "BADAUTHORIZER" message and the mgr on all nodes also has issues 
> "verify_authorizer".
> 
> I verified all the clocks on all servers and they are sinked to the same NTP 
> service and seem good.
> 
> Please please please advise as straight 13 hours of debugging got me nowhere.

If you can pause all client IO, I don't think it can harm to upgrade
everything to 14.2.11, just to be sure you don't hit a bug that is
already fixed. But it depends on the current health of your cluster I
guess.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to