Cluster is back and clean again.  So I started adding plugins and such back
to the mix.

After adding the 'balancer' back, I got crashes in the mgr log.

ceph-post-file: 0feb1562-cdc5-4a99-86ee-91006eaf6056

Turned balancer back off for now.

On Tue, Apr 9, 2019 at 9:38 AM Shawn Edwards <lesser.e...@gmail.com> wrote:

> Update:
>
> I think we have a work-around, but no root cause yet.
>
> What is working is removing the 'v2' bits from the ceph.conf file across
> the cluster, and turning off all cephx authentication.  Now everything
> seems to be talking correctly other than some odd metrics around the edges.
>
> Here's my current ceph.conf, running on all ceph hosts and clients:
>
> [global]
>         fsid = 3f390b5e-2b1d-4a2f-ba00-xxxxxxxxxxxx
>         mon_host = [v1:10.36.9.43:6789/0] [v1:10.36.9.44:6789/0] [v1:
> 10.36.9.45:6789/0]
>         auth_client_required = none
>         auth_cluster_required = none
>         auth_service_required = none
>
> If we get better information as to what's going on, I'll post here for
> future reference
>
>
> On Thu, Apr 4, 2019 at 9:16 AM Sage Weil <s...@newdream.net> wrote:
>
>> On Thu, 4 Apr 2019, Shawn Edwards wrote:
>> > It was disabled in a fit of genetic debugging.  I've now tried to revert
>> > all config settings related to auth and signing to defaults.
>> >
>> > I can't seem to change the auth_*_required settings.  If I try to remove
>> > them, they stay set.  If I try to change them, I get both the old and
>> new
>> > settings:
>> >
>> > root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)'
>> > global        advanced auth_client_required               cephx
>> >                                             *
>> > global        advanced auth_cluster_required              cephx
>> >                                             *
>> > global        advanced auth_service_required              cephx
>> >                                             *
>> > root@tyr-ceph-mon0:~# ceph config rm global auth_service_required
>> > root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)'
>> > global        advanced auth_client_required               cephx
>> >                                             *
>> > global        advanced auth_cluster_required              cephx
>> >                                             *
>> > global        advanced auth_service_required              cephx
>> >                                             *
>> > root@tyr-ceph-mon0:~# ceph config set global auth_service_required none
>> > root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)'
>> > global        advanced auth_client_required               cephx
>> >                                             *
>> > global        advanced auth_cluster_required              cephx
>> >                                             *
>> > global        advanced auth_service_required              none
>> >                                            *
>> > global        advanced auth_service_required              cephx
>> >                                             *
>> >
>> > I know these are set to RO, but according to your blog posts, this means
>> > they don't get updated until a daemon restart.  Does this look correct
>> to
>> > you?  I'm assuming I need to restart all daemons on all hosts.  Is this
>> > correct?
>>
>> Yeah, that is definitely not behaving properly.  Can you try "ceph
>> config-key dump | grep config/" to look at how those keys are stored?
>> You
>> should see something like
>>
>>     "config/auth_cluster_required": "cephx",
>>     "config/auth_service_required": "cephx",
>>     "config/auth_service_ticket_ttl": "3600.000000",
>>
>> but maybe those names are formed differently, maybe with ".../global/..."
>> in there?  My guess is a subtle naming behavior change between mimic or
>> something.  You can remove the keys via the config-key interface and then
>> restart the mons (or adjust any random config option) to make the
>> mons refresh.  After that config dump should show the right thing.
>>
>> Maybe a disagreement/confusion about the actual value of
>> auth_service_ticket_ttl is the cause of this.  You might try doing 'ceph
>> config show osd.0' and/or a mon to see what value for the auth options
>> the
>> daemons are actually using and reporting...
>>
>> sage
>>
>>
>> >
>> > On Thu, Apr 4, 2019 at 5:54 AM Sage Weil <s...@newdream.net> wrote:
>> >
>> > > That log shows
>> > >
>> > > 2019-04-03 15:39:53.299 7f3733f18700 10 monclient: tick
>> > > 2019-04-03 15:39:53.299 7f3733f18700 10 cephx: validate_tickets want
>> 53
>> > > have 53 need 0
>> > > 2019-04-03 15:39:53.299 7f3733f18700 20 cephx client: need_tickets:
>> > > want=53 have=53 need=0
>> > > 2019-04-03 15:39:53.299 7f3733f18700 10 monclient:
>> _check_auth_rotating
>> > > have uptodate secrets (they expire after 2019-04-03 15:39:23.301595)
>> > > 2019-04-03 15:39:53.299 7f3733f18700 10 auth: dump_rotating:
>> > > 2019-04-03 15:39:53.299 7f3733f18700 10 auth:  id 41691 A4Q== expires
>> > > 2019-04-03 14:43:07.042860
>> > > 2019-04-03 15:39:53.299 7f3733f18700 10 auth:  id 41692 AD9Q== expires
>> > > 2019-04-03 15:43:09.895511
>> > > 2019-04-03 15:39:53.299 7f3733f18700 10 auth:  id 41693 ADQ== expires
>> > > 2019-04-03 16:43:09.895511
>> > >
>> > > which is all fine.  It is getting BADAUTHORIZER talking to another
>> OSD,
>> > > but I'm guessing it's because that other OSD doesn't have the right
>> > > tickets.  It's hard to tell what's wrong without having al the OSD
>> logs
>> > > and being able to see the matching ticket renewals (or lack thereof)
>> on
>> > > the other end of a specific connection.
>> > >
>> > > I missed this before:
>> > >
>> > > WHO      MASK LEVEL    OPTION                             VALUE
>> > >
>> > > RO
>> > > global        advanced auth_client_required               cephx
>> > >
>> > > *
>> > > global        advanced auth_cluster_required              cephx
>> > >
>> > > *
>> > > global        advanced auth_service_required              cephx
>> > >
>> > > *
>> > >
>> > > ^ Note that these three options aren't actually needed.  The only
>> > > non-default value is auth_client_required, and it's 'cephx, none'
>> meaning
>> > > the client-side code will allow itself to connect to a cluster with
>> auth
>> > > disabled.
>> > >
>> > > global        advanced auth_service_ticket_ttl            3600.000000
>> > > global        advanced cephx_sign_messages                false
>> > >
>> > > The fact that the tickets aren't renewing properly suggests that
>> there's
>> > > something broken when the ttl option is modified (although it's not
>> clear
>> > > what yet).  Similarly, disabling message signing isn't common either
>> and
>> > > isn't covered by the automated test suite.  If you're just trying to
>> get
>> > > the cluster up I'd remove one or both of those two options and
>> restart all
>> > > daemons.
>> > >
>> > > Looking closer, I see lots of these:
>> > >
>> > > 2019-04-03 15:39:56.435 7f374a1fb700 10 _calc_signature seq 46
>> front_crc_
>> > > = 1548381741 middle_crc = 0 data_crc = 0 sig = 7970349432988260882
>> > > 2019-04-03 15:39:56.435 7f374a1fb700  0 SIGN: MSG 46 Sender did not
>> set
>> > > CEPH_MSG_FOOTER_SIGNED.
>> > > 2019-04-03 15:39:56.435 7f374a1fb700  0 SIGN: MSG 46 Message signature
>> > > does not match contents.
>> > > 2019-04-03 15:39:56.435 7f374a1fb700  0 SIGN: MSG 46Signature on
>> message:
>> > > 2019-04-03 15:39:56.435 7f374a1fb700  0 SIGN: MSG 46    sig: 0
>> > > 2019-04-03 15:39:56.435 7f374a1fb700  0 SIGN: MSG 46Locally calculated
>> > > signature:
>> > > 2019-04-03 15:39:56.435 7f374a1fb700  0 SIGN: MSG 46
>> > > sig_check:7970349432988260882
>> > > 2019-04-03 15:39:56.435 7f374a1fb700  0 Signature failed.
>> > > 2019-04-03 15:39:56.435 7f374a1fb700  0 --1- v1:10.36.9.48:6817/47343
>> >>
>> > > v1:10.36.9.37:6849/12739 conn(0x58ef0400 0x3c2fd800 :6817
>> > > s=READ_FOOTER_AND_DISPATCH pgs=3071352 cs=412808
>> l=0).handle_message_footer
>> > > Signature check failed
>> > >
>> > > I'm not sure which peer it's talking to here, but it looks like the
>> > > cephx_sign_messages setting isn't being uniformly applied.  Try
>> removing
>> > > that option!
>> > >
>> > > (Out of curiosity, why was it disabled?)
>> > >
>> > > sage
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, 3 Apr 2019, Shawn Edwards wrote:
>> > > > ceph-post-file: 789769c4-e7e4-47d3-8fb7-475ea4cfe14a
>> > > >
>> > > > This should have the information you need.
>> > > >
>> > > > On Wed, Apr 3, 2019 at 5:49 PM Sage Weil <s...@newdream.net> wrote:
>> > > >
>> > > > > This OSD also appears on teh accepting end of things, and probably
>> > > > > has newer keys that the OSD connecting (tho it' shard to tell
>> > > > > because teh debug level isn't turned up).
>> > > > >
>> > > > > The goal is to find an osd still throwing BADAUTHORIZER messages
>> and
>> > > then
>> > > > > turn of debug_auth=20 and debug_monc=20.  Specifically I'm
>> looking for
>> > > the
>> > > > > output from MonClient::_check_auth_rotating(), which should tell
>> us
>> > > > > whether the OSD thinks its rotating keys are up to date, or
>> whether it
>> > > is
>> > > > > having trouble renewing them, or what.  It's weird that some OSDs
>> are
>> > > > > using hours-old keys to try to authenticate.  :/
>> > > > >
>> > > > > sage
>> > > > >
>> > > > >
>> > > > > On Wed, 3 Apr 2019, Shawn Edwards wrote:
>> > > > >
>> > > > > > ceph-post-file: 60e07a0c-ee5b-4174-9f51-fa091d5662dc
>> > > > > >
>> > > > > > On Wed, Apr 3, 2019 at 5:30 PM Shawn Edwards <
>> lesser.e...@gmail.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > > According to ceph versions, all bits are running 14.2.0
>> > > > > > >
>> > > > > > > I have restarted all of the OSD at least twice and am still
>> > > getting the
>> > > > > > > same error.
>> > > > > > >
>> > > > > > > I'll send a log file with confirmed interesting bad behavior
>> > > shortly
>> > > > > > >
>> > > > > > > On Wed, Apr 3, 2019, 17:17 Sage Weil <s...@newdream.net>
>> wrote:
>> > > > > > >
>> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 --1- v1:
>> > > > > 10.36.9.46:6813/5003637
>> > > > > > >> >> v1:10.36.9.28:6809/8224 conn(0xf6a6000 0x30a02000 :6813
>> > > > > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>> > > > > l=0).handle_connect_message_2
>> > > > > > >> authorizor_protocol 2 len 174
>> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 20
>> AuthRegistry(0xcd64a40)
>> > > > > > >> get_handler peer_type 4 method 2 cluster_methods [2]
>> > > service_methods
>> > > > > [2]
>> > > > > > >> client_methods [2]
>> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 cephx:
>> verify_authorizer
>> > > > > > >> decrypted service osd secret_id=41686
>> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700  0 auth: could not find
>> > > > > > >> secret_id=41686
>> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: dump_rotating:
>> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth:  id 41691 ...
>> > > expires
>> > > > > > >> 2019-04-03 14:43:07.042860
>> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth:  id 41692 ...
>> > > expires
>> > > > > > >> 2019-04-03 15:43:09.895511
>> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth:  id 41693 ...
>> > > expires
>> > > > > > >> 2019-04-03 16:43:09.895511
>> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700  0 cephx:
>> verify_authorizer
>> > > could
>> > > > > > >> not get service secret for service osd secret_id=41686
>> > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700  0 --1- v1:
>> > > > > 10.36.9.46:6813/5003637
>> > > > > > >> >> v1:10.36.9.28:6809/8224 conn(0xf6a6000 0x30a02000 :6813
>> > > > > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>> > > > > l=0).handle_connect_message_2:
>> > > > > > >> got bad authorizer, auth_reply_len=0
>> > > > > > >>
>> > > > > > >> For some reason this OSD has much newer rotating keys than
>> the
>> > > > > > >> connecting OSD.  But earlier in the day, this osd was the one
>> > > > > > >> getting BADAUTHORIZER, so maybe that shifted.  Can you find
>> an OSD
>> > > > > where
>> > > > > > >> you still see BADAUTHORIZER appearing in the log?
>> > > > > > >>
>> > > > > > >> My guess is that if you restart the OSDs, they'll get fresh
>> > > rotating
>> > > > > keys
>> > > > > > >> and things will be fine.  But that doesn't explain why
>> they're not
>> > > > > > >> renewing on their own right now.. that I'm not so sure about.
>> > > > > > >>
>> > > > > > >> Are your mons all running nautilus?  Does 'ceph versions'
>> show
>> > > > > everything
>> > > > > > >> has upgraded?
>> > > > > > >>
>> > > > > > >> sage
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> On Wed, 3 Apr 2019, Shawn Edwards wrote:
>> > > > > > >>
>> > > > > > >> > File uploaded: f1a2bfb3-92b4-495c-8706-f99cb228efc7
>> > > > > > >> >
>> > > > > > >> > On Wed, Apr 3, 2019 at 4:57 PM Sage Weil <
>> s...@newdream.net>
>> > > wrote:
>> > > > > > >> >
>> > > > > > >> > > Hmm, that doesn't help.
>> > > > > > >> > >
>> > > > > > >> > > Can you set
>> > > > > > >> > >
>> > > > > > >> > >  ceph config set osd debug_ms 20
>> > > > > > >> > >  ceph config set osd debug_auth 20
>> > > > > > >> > >  ceph config set osd debug_monc 20
>> > > > > > >> > >
>> > > > > > >> > > for a few minutes and ceph-post-file the osd logs?  (Or
>> send a
>> > > > > private
>> > > > > > >> > > email with a link or something.)
>> > > > > > >> > >
>> > > > > > >> > > Thanks!
>> > > > > > >> > > sage
>> > > > > > >> > >
>> > > > > > >> > >
>> > > > > > >> > > On Wed, 3 Apr 2019, Shawn Edwards wrote:
>> > > > > > >> > >
>> > > > > > >> > > > No strange auth config:
>> > > > > > >> > > >
>> > > > > > >> > > > root@tyr-ceph-mon0:~# ceph config dump | grep -E
>> > > '(auth|cephx)'
>> > > > > > >> > > > global        advanced auth_client_required
>> > >  cephx
>> > > > > > >> > > >                                             *
>> > > > > > >> > > > global        advanced auth_cluster_required
>> > > cephx
>> > > > > > >> > > >                                             *
>> > > > > > >> > > > global        advanced auth_service_required
>> > > cephx
>> > > > > > >> > > >                                             *
>> > > > > > >> > > >
>> > > > > > >> > > > All boxes are using 'minimal' ceph.conf files and
>> > > centralized
>> > > > > > >> config.
>> > > > > > >> > > >
>> > > > > > >> > > > If you need the full config, it's here:
>> > > > > > >> > > >
>> > > > >
>> https://gist.github.com/lesserevil/3b82d37e517f4561ce53c81629717aae
>> > > > > > >> > > >
>> > > > > > >> > > > On Wed, Apr 3, 2019 at 4:07 PM Sage Weil <
>> s...@newdream.net
>> > > >
>> > > > > wrote:
>> > > > > > >> > > >
>> > > > > > >> > > > > On Wed, 3 Apr 2019, Shawn Edwards wrote:
>> > > > > > >> > > > > > Recent nautilus upgrade from mimic.  No issues on
>> mimic.
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > Now getting this or similar in all osd logs, there
>> is
>> > > very
>> > > > > > >> little osd
>> > > > > > >> > > > > > communicatoin, and most of the PG are either
>> 'down' or
>> > > > > > >> 'unknown',
>> > > > > > >> > > even
>> > > > > > >> > > > > > though I can see the data on the filestores.
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > 2019-04-03 13:47:55.280 7f13346e3700  0 --1- [v2:
>> > > > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >>
>> v1:
>> > > > > > >> > > 10.36.9.37:6821/8825
>> > > > > > >> > > > > > conn(0xa7132000 0xa6b28000 :-1
>> > > s=CONNECTING_SEND_CONNECT_MSG
>> > > > > > >> pgs=0
>> > > > > > >> > > cs=0
>> > > > > > >> > > > > > l=0).handle_connect_reply_2 connect got
>> BADAUTHORIZER
>> > > > > > >> > > > > > 2019-04-03 13:47:55.296 7f1333ee2700  0 --1- [v2:
>> > > > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >>
>> v1:
>> > > > > > >> > > > > 10.36.9.37:6841/11204
>> > > > > > >> > > > > > conn(0xa9826d00 0xa9b78000 :-1
>> > > s=CONNECTING_SEND_CONNECT_MSG
>> > > > > > >> pgs=0
>> > > > > > >> > > cs=0
>> > > > > > >> > > > > > l=0).handle_connect_reply_2 connect got
>> BADAUTHORIZER
>> > > > > > >> > > > > > 2019-04-03 13:47:55.340 7f13346e3700  0 --1- [v2:
>> > > > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >>
>> v1:
>> > > > > > >> > > 10.36.9.37:6829/8425
>> > > > > > >> > > > > > conn(0xa7997180 0xaeb22800 :-1
>> > > s=CONNECTING_SEND_CONNECT_MSG
>> > > > > > >> pgs=0
>> > > > > > >> > > cs=0
>> > > > > > >> > > > > > l=0).handle_connect_reply_2 connect got
>> BADAUTHORIZER
>> > > > > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700  0 auth:
>> could not
>> > > find
>> > > > > > >> > > > > secret_id=41687
>> > > > > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700  0 cephx:
>> > > > > verify_authorizer
>> > > > > > >> > > could
>> > > > > > >> > > > > not
>> > > > > > >> > > > > > get service secret for service osd secret_id=41687
>> > > > > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700  0 --1- [v2:
>> > > > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >>
>> v1:
>> > > > > > >> > > > > 10.36.9.48:6805/49547
>> > > > > > >> > > > > > conn(0xe02f24480 0xe088cb800 :6803
>> > > > > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH
>> > > > > > >> > > > > pgs=0
>> > > > > > >> > > > > > cs=0 l=0).handle_connect_message_2: got bad
>> authorizer,
>> > > > > > >> > > auth_reply_len=0
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > Thoughts?  I have confirmed that all ceph boxes
>> have
>> > > good
>> > > > > time
>> > > > > > >> sync.
>> > > > > > >> > > > >
>> > > > > > >> > > > > Do you have any non-default auth-related settings in
>> > > > > ceph.conf?
>> > > > > > >> > > > >
>> > > > > > >> > > > > sage
>> > > > > > >> > > > >
>> > > > > > >> > > >
>> > > > > > >> > > >
>> > > > > > >> > > > --
>> > > > > > >> > > >  Shawn Edwards
>> > > > > > >> > > >  Beware programmers with screwdrivers.  They tend to
>> spill
>> > > them
>> > > > > on
>> > > > > > >> their
>> > > > > > >> > > > keyboards.
>> > > > > > >> > > >
>> > > > > > >> > >
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> > --
>> > > > > > >> >  Shawn Edwards
>> > > > > > >> >  Beware programmers with screwdrivers.  They tend to spill
>> them
>> > > on
>> > > > > their
>> > > > > > >> > keyboards.
>> > > > > > >> >
>> > > > > > >>
>> > > > > > >
>> > > > > >
>> > > > > > --
>> > > > > >  Shawn Edwards
>> > > > > >  Beware programmers with screwdrivers.  They tend to spill them
>> on
>> > > their
>> > > > > > keyboards.
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > >  Shawn Edwards
>> > > >  Beware programmers with screwdrivers.  They tend to spill them on
>> their
>> > > > keyboards.
>> > > >
>> > >
>> >
>> >
>> > --
>> >  Shawn Edwards
>> >  Beware programmers with screwdrivers.  They tend to spill them on their
>> > keyboards.
>> >
>>
>
>
> --
>  Shawn Edwards
>  Beware programmers with screwdrivers.  They tend to spill them on their
> keyboards.
>


-- 
 Shawn Edwards
 Beware programmers with screwdrivers.  They tend to spill them on their
keyboards.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to