Hi everyone,

I have a serious problem which currently exists of my entire Ceph no longer 
being able to provide service. As if yesterday I added 10 OSD's total 2 per 
node, the rebalance started and took some IO but seemed to be doing its work. 
This morning the cluster was still processing the rebalance and taking so much 
IO that nearly all OSD's where marked as "slow ops" and from there everything 
went wrong. As attempt to clear as much IO for de rebalance I stoped all the 
clients and waited for the rebalance to finish. After it finished the cluster 
remained extremely slow and unusable. Whilst trying to debug I restarted 
several services and nodes trying to find the problem. Now the cluster has 
entered a state where multiple OSD's remain slow, various OSD's show a 
"BADAUTHORIZER" message and the mgr on all nodes also has issues 
"verify_authorizer".

I verified all the clocks on all servers and they are sinked to the same NTP 
service and seem good.

Please please please advise as straight 13 hours of debugging got me nowhere.

Current version: ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) 
nautilus (stable)

Mgr error example:

2020-08-24 20:19:12.865 7f2baf56a700  0 cephx: verify_authorizer could not get 
service secret for service mgr secret_id=12230
2020-08-24 20:19:13.043 7f2bb056c700  0 auth: could not find secret_id=12230
2020-08-24 20:19:13.043 7f2bb056c700  0 cephx: verify_authorizer could not get 
service secret for service mgr secret_id=12230
2020-08-24 20:19:13.210 7f2bb056c700  0 auth: could not find secret_id=12230
2020-08-24 20:19:13.210 7f2bb056c700  0 cephx: verify_authorizer could not get 
service secret for service mgr secret_id=12230

OSD error example:
2020-08-24 19:47:15.777 7f9957d79700 -1 osd.19 41255 get_health_metrics 
reporting 72 slow ops, oldest is osd_op(mds.0.1510:4 12.a6 12.4b2c82a6 
(undecoded) ondisk+retry+read+known_if_redirected+full_force e41119)
2020-08-24 19:47:15.833 7f995c88b700  0 auth: could not find secret_id=12230
2020-08-24 19:47:15.833 7f995c88b700  0 cephx: verify_authorizer could not get 
service secret for service osd secret_id=12230
2020-08-24 19:47:15.833 7f995c88b700  0 --1- 
[v2:10.201.1.17:6814/1030299,v1:10.201.1.17:6815/1030299] >> 
v1:10.201.1.20:6823/1023281 conn(0x55affb8e7c00 0x55b007441000 :6815 
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2: got 
bad authorizer, auth_reply_len=0
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to