Cluster is back and clean again. So I started adding plugins and such back to the mix.
After adding the 'balancer' back, I got crashes in the mgr log. ceph-post-file: 0feb1562-cdc5-4a99-86ee-91006eaf6056 Turned balancer back off for now. On Tue, Apr 9, 2019 at 9:38 AM Shawn Edwards <lesser.e...@gmail.com> wrote: > Update: > > I think we have a work-around, but no root cause yet. > > What is working is removing the 'v2' bits from the ceph.conf file across > the cluster, and turning off all cephx authentication. Now everything > seems to be talking correctly other than some odd metrics around the edges. > > Here's my current ceph.conf, running on all ceph hosts and clients: > > [global] > fsid = 3f390b5e-2b1d-4a2f-ba00-xxxxxxxxxxxx > mon_host = [v1:10.36.9.43:6789/0] [v1:10.36.9.44:6789/0] [v1: > 10.36.9.45:6789/0] > auth_client_required = none > auth_cluster_required = none > auth_service_required = none > > If we get better information as to what's going on, I'll post here for > future reference > > > On Thu, Apr 4, 2019 at 9:16 AM Sage Weil <s...@newdream.net> wrote: > >> On Thu, 4 Apr 2019, Shawn Edwards wrote: >> > It was disabled in a fit of genetic debugging. I've now tried to revert >> > all config settings related to auth and signing to defaults. >> > >> > I can't seem to change the auth_*_required settings. If I try to remove >> > them, they stay set. If I try to change them, I get both the old and >> new >> > settings: >> > >> > root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)' >> > global advanced auth_client_required cephx >> > * >> > global advanced auth_cluster_required cephx >> > * >> > global advanced auth_service_required cephx >> > * >> > root@tyr-ceph-mon0:~# ceph config rm global auth_service_required >> > root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)' >> > global advanced auth_client_required cephx >> > * >> > global advanced auth_cluster_required cephx >> > * >> > global advanced auth_service_required cephx >> > * >> > root@tyr-ceph-mon0:~# ceph config set global auth_service_required none >> > root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)' >> > global advanced auth_client_required cephx >> > * >> > global advanced auth_cluster_required cephx >> > * >> > global advanced auth_service_required none >> > * >> > global advanced auth_service_required cephx >> > * >> > >> > I know these are set to RO, but according to your blog posts, this means >> > they don't get updated until a daemon restart. Does this look correct >> to >> > you? I'm assuming I need to restart all daemons on all hosts. Is this >> > correct? >> >> Yeah, that is definitely not behaving properly. Can you try "ceph >> config-key dump | grep config/" to look at how those keys are stored? >> You >> should see something like >> >> "config/auth_cluster_required": "cephx", >> "config/auth_service_required": "cephx", >> "config/auth_service_ticket_ttl": "3600.000000", >> >> but maybe those names are formed differently, maybe with ".../global/..." >> in there? My guess is a subtle naming behavior change between mimic or >> something. You can remove the keys via the config-key interface and then >> restart the mons (or adjust any random config option) to make the >> mons refresh. After that config dump should show the right thing. >> >> Maybe a disagreement/confusion about the actual value of >> auth_service_ticket_ttl is the cause of this. You might try doing 'ceph >> config show osd.0' and/or a mon to see what value for the auth options >> the >> daemons are actually using and reporting... >> >> sage >> >> >> > >> > On Thu, Apr 4, 2019 at 5:54 AM Sage Weil <s...@newdream.net> wrote: >> > >> > > That log shows >> > > >> > > 2019-04-03 15:39:53.299 7f3733f18700 10 monclient: tick >> > > 2019-04-03 15:39:53.299 7f3733f18700 10 cephx: validate_tickets want >> 53 >> > > have 53 need 0 >> > > 2019-04-03 15:39:53.299 7f3733f18700 20 cephx client: need_tickets: >> > > want=53 have=53 need=0 >> > > 2019-04-03 15:39:53.299 7f3733f18700 10 monclient: >> _check_auth_rotating >> > > have uptodate secrets (they expire after 2019-04-03 15:39:23.301595) >> > > 2019-04-03 15:39:53.299 7f3733f18700 10 auth: dump_rotating: >> > > 2019-04-03 15:39:53.299 7f3733f18700 10 auth: id 41691 A4Q== expires >> > > 2019-04-03 14:43:07.042860 >> > > 2019-04-03 15:39:53.299 7f3733f18700 10 auth: id 41692 AD9Q== expires >> > > 2019-04-03 15:43:09.895511 >> > > 2019-04-03 15:39:53.299 7f3733f18700 10 auth: id 41693 ADQ== expires >> > > 2019-04-03 16:43:09.895511 >> > > >> > > which is all fine. It is getting BADAUTHORIZER talking to another >> OSD, >> > > but I'm guessing it's because that other OSD doesn't have the right >> > > tickets. It's hard to tell what's wrong without having al the OSD >> logs >> > > and being able to see the matching ticket renewals (or lack thereof) >> on >> > > the other end of a specific connection. >> > > >> > > I missed this before: >> > > >> > > WHO MASK LEVEL OPTION VALUE >> > > >> > > RO >> > > global advanced auth_client_required cephx >> > > >> > > * >> > > global advanced auth_cluster_required cephx >> > > >> > > * >> > > global advanced auth_service_required cephx >> > > >> > > * >> > > >> > > ^ Note that these three options aren't actually needed. The only >> > > non-default value is auth_client_required, and it's 'cephx, none' >> meaning >> > > the client-side code will allow itself to connect to a cluster with >> auth >> > > disabled. >> > > >> > > global advanced auth_service_ticket_ttl 3600.000000 >> > > global advanced cephx_sign_messages false >> > > >> > > The fact that the tickets aren't renewing properly suggests that >> there's >> > > something broken when the ttl option is modified (although it's not >> clear >> > > what yet). Similarly, disabling message signing isn't common either >> and >> > > isn't covered by the automated test suite. If you're just trying to >> get >> > > the cluster up I'd remove one or both of those two options and >> restart all >> > > daemons. >> > > >> > > Looking closer, I see lots of these: >> > > >> > > 2019-04-03 15:39:56.435 7f374a1fb700 10 _calc_signature seq 46 >> front_crc_ >> > > = 1548381741 middle_crc = 0 data_crc = 0 sig = 7970349432988260882 >> > > 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46 Sender did not >> set >> > > CEPH_MSG_FOOTER_SIGNED. >> > > 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46 Message signature >> > > does not match contents. >> > > 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46Signature on >> message: >> > > 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46 sig: 0 >> > > 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46Locally calculated >> > > signature: >> > > 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46 >> > > sig_check:7970349432988260882 >> > > 2019-04-03 15:39:56.435 7f374a1fb700 0 Signature failed. >> > > 2019-04-03 15:39:56.435 7f374a1fb700 0 --1- v1:10.36.9.48:6817/47343 >> >> >> > > v1:10.36.9.37:6849/12739 conn(0x58ef0400 0x3c2fd800 :6817 >> > > s=READ_FOOTER_AND_DISPATCH pgs=3071352 cs=412808 >> l=0).handle_message_footer >> > > Signature check failed >> > > >> > > I'm not sure which peer it's talking to here, but it looks like the >> > > cephx_sign_messages setting isn't being uniformly applied. Try >> removing >> > > that option! >> > > >> > > (Out of curiosity, why was it disabled?) >> > > >> > > sage >> > > >> > > >> > > >> > > >> > > On Wed, 3 Apr 2019, Shawn Edwards wrote: >> > > > ceph-post-file: 789769c4-e7e4-47d3-8fb7-475ea4cfe14a >> > > > >> > > > This should have the information you need. >> > > > >> > > > On Wed, Apr 3, 2019 at 5:49 PM Sage Weil <s...@newdream.net> wrote: >> > > > >> > > > > This OSD also appears on teh accepting end of things, and probably >> > > > > has newer keys that the OSD connecting (tho it' shard to tell >> > > > > because teh debug level isn't turned up). >> > > > > >> > > > > The goal is to find an osd still throwing BADAUTHORIZER messages >> and >> > > then >> > > > > turn of debug_auth=20 and debug_monc=20. Specifically I'm >> looking for >> > > the >> > > > > output from MonClient::_check_auth_rotating(), which should tell >> us >> > > > > whether the OSD thinks its rotating keys are up to date, or >> whether it >> > > is >> > > > > having trouble renewing them, or what. It's weird that some OSDs >> are >> > > > > using hours-old keys to try to authenticate. :/ >> > > > > >> > > > > sage >> > > > > >> > > > > >> > > > > On Wed, 3 Apr 2019, Shawn Edwards wrote: >> > > > > >> > > > > > ceph-post-file: 60e07a0c-ee5b-4174-9f51-fa091d5662dc >> > > > > > >> > > > > > On Wed, Apr 3, 2019 at 5:30 PM Shawn Edwards < >> lesser.e...@gmail.com> >> > > > > wrote: >> > > > > > >> > > > > > > According to ceph versions, all bits are running 14.2.0 >> > > > > > > >> > > > > > > I have restarted all of the OSD at least twice and am still >> > > getting the >> > > > > > > same error. >> > > > > > > >> > > > > > > I'll send a log file with confirmed interesting bad behavior >> > > shortly >> > > > > > > >> > > > > > > On Wed, Apr 3, 2019, 17:17 Sage Weil <s...@newdream.net> >> wrote: >> > > > > > > >> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 --1- v1: >> > > > > 10.36.9.46:6813/5003637 >> > > > > > >> >> v1:10.36.9.28:6809/8224 conn(0xf6a6000 0x30a02000 :6813 >> > > > > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >> > > > > l=0).handle_connect_message_2 >> > > > > > >> authorizor_protocol 2 len 174 >> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 20 >> AuthRegistry(0xcd64a40) >> > > > > > >> get_handler peer_type 4 method 2 cluster_methods [2] >> > > service_methods >> > > > > [2] >> > > > > > >> client_methods [2] >> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 cephx: >> verify_authorizer >> > > > > > >> decrypted service osd secret_id=41686 >> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 0 auth: could not find >> > > > > > >> secret_id=41686 >> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: dump_rotating: >> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: id 41691 ... >> > > expires >> > > > > > >> 2019-04-03 14:43:07.042860 >> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: id 41692 ... >> > > expires >> > > > > > >> 2019-04-03 15:43:09.895511 >> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: id 41693 ... >> > > expires >> > > > > > >> 2019-04-03 16:43:09.895511 >> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 0 cephx: >> verify_authorizer >> > > could >> > > > > > >> not get service secret for service osd secret_id=41686 >> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 0 --1- v1: >> > > > > 10.36.9.46:6813/5003637 >> > > > > > >> >> v1:10.36.9.28:6809/8224 conn(0xf6a6000 0x30a02000 :6813 >> > > > > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >> > > > > l=0).handle_connect_message_2: >> > > > > > >> got bad authorizer, auth_reply_len=0 >> > > > > > >> >> > > > > > >> For some reason this OSD has much newer rotating keys than >> the >> > > > > > >> connecting OSD. But earlier in the day, this osd was the one >> > > > > > >> getting BADAUTHORIZER, so maybe that shifted. Can you find >> an OSD >> > > > > where >> > > > > > >> you still see BADAUTHORIZER appearing in the log? >> > > > > > >> >> > > > > > >> My guess is that if you restart the OSDs, they'll get fresh >> > > rotating >> > > > > keys >> > > > > > >> and things will be fine. But that doesn't explain why >> they're not >> > > > > > >> renewing on their own right now.. that I'm not so sure about. >> > > > > > >> >> > > > > > >> Are your mons all running nautilus? Does 'ceph versions' >> show >> > > > > everything >> > > > > > >> has upgraded? >> > > > > > >> >> > > > > > >> sage >> > > > > > >> >> > > > > > >> >> > > > > > >> On Wed, 3 Apr 2019, Shawn Edwards wrote: >> > > > > > >> >> > > > > > >> > File uploaded: f1a2bfb3-92b4-495c-8706-f99cb228efc7 >> > > > > > >> > >> > > > > > >> > On Wed, Apr 3, 2019 at 4:57 PM Sage Weil < >> s...@newdream.net> >> > > wrote: >> > > > > > >> > >> > > > > > >> > > Hmm, that doesn't help. >> > > > > > >> > > >> > > > > > >> > > Can you set >> > > > > > >> > > >> > > > > > >> > > ceph config set osd debug_ms 20 >> > > > > > >> > > ceph config set osd debug_auth 20 >> > > > > > >> > > ceph config set osd debug_monc 20 >> > > > > > >> > > >> > > > > > >> > > for a few minutes and ceph-post-file the osd logs? (Or >> send a >> > > > > private >> > > > > > >> > > email with a link or something.) >> > > > > > >> > > >> > > > > > >> > > Thanks! >> > > > > > >> > > sage >> > > > > > >> > > >> > > > > > >> > > >> > > > > > >> > > On Wed, 3 Apr 2019, Shawn Edwards wrote: >> > > > > > >> > > >> > > > > > >> > > > No strange auth config: >> > > > > > >> > > > >> > > > > > >> > > > root@tyr-ceph-mon0:~# ceph config dump | grep -E >> > > '(auth|cephx)' >> > > > > > >> > > > global advanced auth_client_required >> > > cephx >> > > > > > >> > > > * >> > > > > > >> > > > global advanced auth_cluster_required >> > > cephx >> > > > > > >> > > > * >> > > > > > >> > > > global advanced auth_service_required >> > > cephx >> > > > > > >> > > > * >> > > > > > >> > > > >> > > > > > >> > > > All boxes are using 'minimal' ceph.conf files and >> > > centralized >> > > > > > >> config. >> > > > > > >> > > > >> > > > > > >> > > > If you need the full config, it's here: >> > > > > > >> > > > >> > > > > >> https://gist.github.com/lesserevil/3b82d37e517f4561ce53c81629717aae >> > > > > > >> > > > >> > > > > > >> > > > On Wed, Apr 3, 2019 at 4:07 PM Sage Weil < >> s...@newdream.net >> > > > >> > > > > wrote: >> > > > > > >> > > > >> > > > > > >> > > > > On Wed, 3 Apr 2019, Shawn Edwards wrote: >> > > > > > >> > > > > > Recent nautilus upgrade from mimic. No issues on >> mimic. >> > > > > > >> > > > > > >> > > > > > >> > > > > > Now getting this or similar in all osd logs, there >> is >> > > very >> > > > > > >> little osd >> > > > > > >> > > > > > communicatoin, and most of the PG are either >> 'down' or >> > > > > > >> 'unknown', >> > > > > > >> > > even >> > > > > > >> > > > > > though I can see the data on the filestores. >> > > > > > >> > > > > > >> > > > > > >> > > > > > 2019-04-03 13:47:55.280 7f13346e3700 0 --1- [v2: >> > > > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> >> v1: >> > > > > > >> > > 10.36.9.37:6821/8825 >> > > > > > >> > > > > > conn(0xa7132000 0xa6b28000 :-1 >> > > s=CONNECTING_SEND_CONNECT_MSG >> > > > > > >> pgs=0 >> > > > > > >> > > cs=0 >> > > > > > >> > > > > > l=0).handle_connect_reply_2 connect got >> BADAUTHORIZER >> > > > > > >> > > > > > 2019-04-03 13:47:55.296 7f1333ee2700 0 --1- [v2: >> > > > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> >> v1: >> > > > > > >> > > > > 10.36.9.37:6841/11204 >> > > > > > >> > > > > > conn(0xa9826d00 0xa9b78000 :-1 >> > > s=CONNECTING_SEND_CONNECT_MSG >> > > > > > >> pgs=0 >> > > > > > >> > > cs=0 >> > > > > > >> > > > > > l=0).handle_connect_reply_2 connect got >> BADAUTHORIZER >> > > > > > >> > > > > > 2019-04-03 13:47:55.340 7f13346e3700 0 --1- [v2: >> > > > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> >> v1: >> > > > > > >> > > 10.36.9.37:6829/8425 >> > > > > > >> > > > > > conn(0xa7997180 0xaeb22800 :-1 >> > > s=CONNECTING_SEND_CONNECT_MSG >> > > > > > >> pgs=0 >> > > > > > >> > > cs=0 >> > > > > > >> > > > > > l=0).handle_connect_reply_2 connect got >> BADAUTHORIZER >> > > > > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700 0 auth: >> could not >> > > find >> > > > > > >> > > > > secret_id=41687 >> > > > > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700 0 cephx: >> > > > > verify_authorizer >> > > > > > >> > > could >> > > > > > >> > > > > not >> > > > > > >> > > > > > get service secret for service osd secret_id=41687 >> > > > > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700 0 --1- [v2: >> > > > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> >> v1: >> > > > > > >> > > > > 10.36.9.48:6805/49547 >> > > > > > >> > > > > > conn(0xe02f24480 0xe088cb800 :6803 >> > > > > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH >> > > > > > >> > > > > pgs=0 >> > > > > > >> > > > > > cs=0 l=0).handle_connect_message_2: got bad >> authorizer, >> > > > > > >> > > auth_reply_len=0 >> > > > > > >> > > > > > >> > > > > > >> > > > > > Thoughts? I have confirmed that all ceph boxes >> have >> > > good >> > > > > time >> > > > > > >> sync. >> > > > > > >> > > > > >> > > > > > >> > > > > Do you have any non-default auth-related settings in >> > > > > ceph.conf? >> > > > > > >> > > > > >> > > > > > >> > > > > sage >> > > > > > >> > > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > -- >> > > > > > >> > > > Shawn Edwards >> > > > > > >> > > > Beware programmers with screwdrivers. They tend to >> spill >> > > them >> > > > > on >> > > > > > >> their >> > > > > > >> > > > keyboards. >> > > > > > >> > > > >> > > > > > >> > > >> > > > > > >> > >> > > > > > >> > >> > > > > > >> > -- >> > > > > > >> > Shawn Edwards >> > > > > > >> > Beware programmers with screwdrivers. They tend to spill >> them >> > > on >> > > > > their >> > > > > > >> > keyboards. >> > > > > > >> > >> > > > > > >> >> > > > > > > >> > > > > > >> > > > > > -- >> > > > > > Shawn Edwards >> > > > > > Beware programmers with screwdrivers. They tend to spill them >> on >> > > their >> > > > > > keyboards. >> > > > > > >> > > > > >> > > > >> > > > >> > > > -- >> > > > Shawn Edwards >> > > > Beware programmers with screwdrivers. They tend to spill them on >> their >> > > > keyboards. >> > > > >> > > >> > >> > >> > -- >> > Shawn Edwards >> > Beware programmers with screwdrivers. They tend to spill them on their >> > keyboards. >> > >> > > > -- > Shawn Edwards > Beware programmers with screwdrivers. They tend to spill them on their > keyboards. > -- Shawn Edwards Beware programmers with screwdrivers. They tend to spill them on their keyboards.
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com