I have a server with 12 OSDs on it. Five of them are unable to start, and give 
the following error message in the their logs:

2020-01-28 13:00:41.760 7f61fb490c80  0 monclient: wait_auth_rotating timed out 
after 30
2020-01-28 13:00:41.760 7f61fb490c80 -1 osd.178 411005 unable to obtain 
rotating service keys; retrying

These OSDs were up and running when they initially just died on me. I tried to 
restart them and they failed to come up. I rebooted the node and they did not 
recover. All 5 died within a few hours and were all 5 down by time I started 
poking them. I previously had this happen with 2 other OSDs, one each on 2 
servers each with 12 OSDs. I ended up just purging and recreating those OSDs. I 
would really like to find a solution to fix this problem that does not involve 
purging the OSDs.

I have tried stopping and starting all monitors and managers, one at a time, 
and all at the same time. Additionally, all servers in the cluster have been 
restarted over the past couple of days for various other reasons.

I am on Ceph 14.2.6, Debian buster and am using the Debian packages. All of my 
servers are kept in the time sync via ntp, and this has been verified multiple 
times that everything remains in time sync.

I have googled the error message and tried all of the solutions offered from 
that, but nothing makes any difference.

I would appreciate any constructive advice.

Thanks.

-- ray

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to