Re: [ceph-users] ceph-mgr SIGABRTs on startup after cluster upgrade from Kraken to Luminous

Brad Hubbard Mon, 11 Sep 2017 22:16:46 -0700

On Tue, Sep 12, 2017 at 3:12 PM, Katie Holly <ho...@fuslvz.ws> wrote:
> Ben and Brad,
>
> big thanks to both of you for helping me track down this issue which - 
> seemingly - was caused by more than one radosgw instance sharing the exact 
> same --name value and solved by generating unique keys and --name values for 
> each single radosgw instance.
>
> Right now, all ceph-mgr daemons seem to run perfectly stable, but I'll 
> definitely keep a close eye on the cluster and report back if I see any other 
> issues.
>
> I updated the tracker to include this information as well so developers can 
> hopefully fix this nasty bug or at least include a warning somewhere that one 
> shouldn't run a setup like this.
>
> http://tracker.ceph.com/issues/21197#note-4


Thanks for letting us know the result Katie. I'm sure this issue will
receive some love in the not too distant future :)

>
> --
> Katie
> On 2017-09-12 06:20, Katie Holly wrote:
>> They all share the exact same exec arguments, so yes, they all have the same 
>> --name as well. I'll try to run them with different --name parameters to see 
>> if that solves the issue.
>>
>> --
>> Katie
>>
>> On 2017-09-12 06:13, Ben Hines wrote:
>>> Do the docker containers all have the same rgw --name ? Maybe that is 
>>> confusing ceph...
>>>
>>> On Mon, Sep 11, 2017 at 9:11 PM, Katie Holly <ho...@fuslvz.ws 
>>> <mailto:ho...@fuslvz.ws>> wrote:
>>>
>>>     All radosgw instances are running
>>>     > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) 
>>> luminous (rc)
>>>     as Docker containers, there are 15 of them at any possible time
>>>
>>>
>>>     The "config"/exec-args for the radosgw instances are:
>>>
>>>     /usr/bin/radosgw \
>>>       -d \
>>>       --cluster=ceph \
>>>       --conf=/dev/null \
>>>       --debug-ms=0 \
>>>       --debug-rgw=0/0 \
>>>       --keyring=/etc/ceph/ceph.client.rgw.docker.keyring \
>>>       --logfile=/dev/null \
>>>       --mon-host=mon.ceph.fks.de.fvz.io <http://mon.ceph.fks.de.fvz.io> \
>>>       --name=client.rgw.docker \
>>>       --rgw-content-length-compat=true \
>>>       --rgw-dns-name=de-fks-1.rgw.li <http://de-fks-1.rgw.li> \
>>>       --rgw-region=eu \
>>>       --rgw-zone=eu-de-fks-1 \
>>>       --setgroup=ceph \
>>>       --setuser=ceph
>>>
>>>
>>>     Scaling this Docker radosgw cluster down to just 1 instance seems to 
>>> allow ceph-mgr to run without issues, but as soon as I increase the amount 
>>> of radosgw instances, the risk of ceph-mgr crashing at any random time also 
>>> increases.
>>>
>>>     It seems that 2 radosgw instances are also fine, just anything higher 
>>> than that is not and causes issues. Maybe a race condition?
>>>
>>>     --
>>>     Katie
>>>     On 2017-09-12 05:24, Brad Hubbard wrote:
>>>     > It seems like it's choking on the report from the rados gateway. What
>>>     > version is the rgw node running?
>>>     >
>>>     > If possible, could you shut down the rgw and see if you can then 
>>> start ceph-mgr?
>>>     >
>>>     > Pure stab in the dark just to see if the problem is tied to the rgw 
>>> instance.
>>>     >
>>>     > On Tue, Sep 12, 2017 at 1:07 PM, Katie Holly <ho...@fuslvz.ws 
>>> <mailto:ho...@fuslvz.ws>> wrote:
>>>     >> Thanks, I totally forgot to check the tracker. I added the 
>>> information I collected there, but don't have enough experience with ceph 
>>> to dig through this myself so let's see if someone is willing to sacrifice 
>>> their free time to help debugging this issue.
>>>     >>
>>>     >> --
>>>     >> Katie
>>>     >>
>>>     >> On 2017-09-12 03:15, Brad Hubbard wrote:
>>>     >>> Looks like there is a tracker opened for this.
>>>     >>>
>>>     >>> http://tracker.ceph.com/issues/21197 
>>> <http://tracker.ceph.com/issues/21197>
>>>     >>>
>>>     >>> Please add your details there.
>>>     >>>
>>>     >>> On Tue, Sep 12, 2017 at 11:04 AM, Katie Holly <ho...@fuslvz.ws 
>>> <mailto:ho...@fuslvz.ws>> wrote:
>>>     >>>> Hi,
>>>     >>>>
>>>     >>>> I recently upgraded one of our clusters from Kraken to Luminous 
>>> (the cluster was initialized with Jewel) on Ubuntu 16.04 and deployed 
>>> ceph-mgr on all of our ceph-mon nodes with ceph-deploy.
>>>     >>>>
>>>     >>>> Related log entries after initial deployment of ceph-mgr:
>>>     >>>>
>>>     >>>> 2017-09-11 06:41:53.535025 7fb5aa7b8500  0 set uid:gid to 
>>> 64045:64045 (ceph:ceph)
>>>     >>>> 2017-09-11 06:41:53.535048 7fb5aa7b8500  0 ceph version 12.2.0 
>>> (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process 
>>> (unknown), pid 17031
>>>     >>>> 2017-09-11 06:41:53.536853 7fb5aa7b8500  0 pidfile_write: ignore 
>>> empty --pid-file
>>>     >>>> 2017-09-11 06:41:53.541880 7fb5aa7b8500  1 mgr send_beacon standby
>>>     >>>> 2017-09-11 06:41:54.547383 7fb5a1aec700  1 mgr handle_mgr_map 
>>> Activating!
>>>     >>>> 2017-09-11 06:41:54.547575 7fb5a1aec700  1 mgr handle_mgr_map I am 
>>> now activating
>>>     >>>> 2017-09-11 06:41:54.650677 7fb59dae4700  1 mgr start Creating 
>>> threads for 0 modules
>>>     >>>> 2017-09-11 06:41:54.650696 7fb59dae4700  1 mgr send_beacon active
>>>     >>>> 2017-09-11 06:41:55.542252 7fb59eae6700  1 mgr send_beacon active
>>>     >>>> 2017-09-11 06:41:55.542627 7fb59eae6700  1 mgr.server send_report 
>>> Not sending PG status to monitor yet, waiting for OSDs
>>>     >>>> 2017-09-11 06:41:57.542697 7fb59eae6700  1 mgr send_beacon active
>>>     >>>> [... lots of "send_beacon active" messages ...]
>>>     >>>> 2017-09-11 07:29:29.640892 7fb59eae6700  1 mgr send_beacon active
>>>     >>>> 2017-09-11 07:29:30.866366 7fb59d2e3700 -1 *** Caught signal 
>>> (Aborted) **
>>>     >>>>  in thread 7fb59d2e3700 thread_name:ms_dispatch
>>>     >>>>
>>>     >>>>  ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) 
>>> luminous (rc)
>>>     >>>>  1: (()+0x3de6b4) [0x55f6640e16b4]
>>>     >>>>  2: (()+0x11390) [0x7fb5a8fef390]
>>>     >>>>  3: (gsignal()+0x38) [0x7fb5a7f7f428]
>>>     >>>>  4: (abort()+0x16a) [0x7fb5a7f8102a]
>>>     >>>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) 
>>> [0x7fb5a88c284d]
>>>     >>>>  6: (()+0x8d6b6) [0x7fb5a88c06b6]
>>>     >>>>  7: (()+0x8d701) [0x7fb5a88c0701]
>>>     >>>>  8: (()+0x8d919) [0x7fb5a88c0919]
>>>     >>>>  9: (()+0x2318ad) [0x55f663f348ad]
>>>     >>>>  10: (()+0x3e91bd) [0x55f6640ec1bd]
>>>     >>>>  11: (DaemonPerfCounters::update(MMgrReport*)+0x821) 
>>> [0x55f663f96651]
>>>     >>>>  12: (DaemonServer::handle_report(MMgrReport*)+0x1ae) 
>>> [0x55f663f9b79e]+
>>>     >>>>  13: (DaemonServer::ms_dispatch(Message*)+0x64) [0x55f663fa8d64]
>>>     >>>>  14: (DispatchQueue::entry()+0xf4a) [0x55f664438f3a]
>>>     >>>>  15: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6641dc44d]
>>>     >>>>  16: (()+0x76ba) [0x7fb5a8fe56ba]
>>>     >>>>  17: (clone()+0x6d) [0x7fb5a80513dd]
>>>     >>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
>>> needed to interpret this.
>>>     >>>>
>>>     >>>> --- begin dump of recent events ---
>>>     >>>> [...]
>>>     >>>>
>>>     >>>>
>>>     >>>> I tried to manually run ceph-mgr with
>>>     >>>>> /usr/bin/ceph-mgr -f --cluster ceph --id $HOSTNAME --setuser ceph 
>>> --setgroup ceph
>>>     >>>> which immediately fails to keep running for longer than a few 
>>> seconds.
>>>     >>>> stdout: http://xor.meo.ws/OyvoZF8v0aWq0D-rOOg2y6u03fp_yzYv.txt 
>>> <http://xor.meo.ws/OyvoZF8v0aWq0D-rOOg2y6u03fp_yzYv.txt>
>>>     >>>> logs: http://xor.meo.ws/jcMyjabCfFbTcfZ8GOangLdSfSSqJffr.txt 
>>> <http://xor.meo.ws/jcMyjabCfFbTcfZ8GOangLdSfSSqJffr.txt>
>>>     >>>> objdump: http://xor.meo.ws/oxo2q8h_oKAG6q7mARvNKkR_JdYjn89B.txt 
>>> <http://xor.meo.ws/oxo2q8h_oKAG6q7mARvNKkR_JdYjn89B.txt>
>>>     >>>>
>>>     >>>> Has someone seen such an issue before and knows how to debug or 
>>> even fix this?
>>>     >>>>
>>>     >>>>
>>>     >>>> --
>>>     >>>> Katie
>>>     >>>> _______________________________________________
>>>     >>>> ceph-users mailing list
>>>     >>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>     >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>>     >>>
>>>     >>>
>>>     >>>
>>>     >
>>>     >
>>>     >
>>>     _______________________________________________
>>>     ceph-users mailing list
>>>     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>



-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-mgr SIGABRTs on startup after cluster upgrade from Kraken to Luminous

Reply via email to