On Tue, Sep 12, 2017 at 3:12 PM, Katie Holly <ho...@fuslvz.ws> wrote: > Ben and Brad, > > big thanks to both of you for helping me track down this issue which - > seemingly - was caused by more than one radosgw instance sharing the exact > same --name value and solved by generating unique keys and --name values for > each single radosgw instance. > > Right now, all ceph-mgr daemons seem to run perfectly stable, but I'll > definitely keep a close eye on the cluster and report back if I see any other > issues. > > I updated the tracker to include this information as well so developers can > hopefully fix this nasty bug or at least include a warning somewhere that one > shouldn't run a setup like this. > > http://tracker.ceph.com/issues/21197#note-4
Thanks for letting us know the result Katie. I'm sure this issue will receive some love in the not too distant future :) > > -- > Katie > On 2017-09-12 06:20, Katie Holly wrote: >> They all share the exact same exec arguments, so yes, they all have the same >> --name as well. I'll try to run them with different --name parameters to see >> if that solves the issue. >> >> -- >> Katie >> >> On 2017-09-12 06:13, Ben Hines wrote: >>> Do the docker containers all have the same rgw --name ? Maybe that is >>> confusing ceph... >>> >>> On Mon, Sep 11, 2017 at 9:11 PM, Katie Holly <ho...@fuslvz.ws >>> <mailto:ho...@fuslvz.ws>> wrote: >>> >>> All radosgw instances are running >>> > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) >>> luminous (rc) >>> as Docker containers, there are 15 of them at any possible time >>> >>> >>> The "config"/exec-args for the radosgw instances are: >>> >>> /usr/bin/radosgw \ >>> -d \ >>> --cluster=ceph \ >>> --conf=/dev/null \ >>> --debug-ms=0 \ >>> --debug-rgw=0/0 \ >>> --keyring=/etc/ceph/ceph.client.rgw.docker.keyring \ >>> --logfile=/dev/null \ >>> --mon-host=mon.ceph.fks.de.fvz.io <http://mon.ceph.fks.de.fvz.io> \ >>> --name=client.rgw.docker \ >>> --rgw-content-length-compat=true \ >>> --rgw-dns-name=de-fks-1.rgw.li <http://de-fks-1.rgw.li> \ >>> --rgw-region=eu \ >>> --rgw-zone=eu-de-fks-1 \ >>> --setgroup=ceph \ >>> --setuser=ceph >>> >>> >>> Scaling this Docker radosgw cluster down to just 1 instance seems to >>> allow ceph-mgr to run without issues, but as soon as I increase the amount >>> of radosgw instances, the risk of ceph-mgr crashing at any random time also >>> increases. >>> >>> It seems that 2 radosgw instances are also fine, just anything higher >>> than that is not and causes issues. Maybe a race condition? >>> >>> -- >>> Katie >>> On 2017-09-12 05:24, Brad Hubbard wrote: >>> > It seems like it's choking on the report from the rados gateway. What >>> > version is the rgw node running? >>> > >>> > If possible, could you shut down the rgw and see if you can then >>> start ceph-mgr? >>> > >>> > Pure stab in the dark just to see if the problem is tied to the rgw >>> instance. >>> > >>> > On Tue, Sep 12, 2017 at 1:07 PM, Katie Holly <ho...@fuslvz.ws >>> <mailto:ho...@fuslvz.ws>> wrote: >>> >> Thanks, I totally forgot to check the tracker. I added the >>> information I collected there, but don't have enough experience with ceph >>> to dig through this myself so let's see if someone is willing to sacrifice >>> their free time to help debugging this issue. >>> >> >>> >> -- >>> >> Katie >>> >> >>> >> On 2017-09-12 03:15, Brad Hubbard wrote: >>> >>> Looks like there is a tracker opened for this. >>> >>> >>> >>> http://tracker.ceph.com/issues/21197 >>> <http://tracker.ceph.com/issues/21197> >>> >>> >>> >>> Please add your details there. >>> >>> >>> >>> On Tue, Sep 12, 2017 at 11:04 AM, Katie Holly <ho...@fuslvz.ws >>> <mailto:ho...@fuslvz.ws>> wrote: >>> >>>> Hi, >>> >>>> >>> >>>> I recently upgraded one of our clusters from Kraken to Luminous >>> (the cluster was initialized with Jewel) on Ubuntu 16.04 and deployed >>> ceph-mgr on all of our ceph-mon nodes with ceph-deploy. >>> >>>> >>> >>>> Related log entries after initial deployment of ceph-mgr: >>> >>>> >>> >>>> 2017-09-11 06:41:53.535025 7fb5aa7b8500 0 set uid:gid to >>> 64045:64045 (ceph:ceph) >>> >>>> 2017-09-11 06:41:53.535048 7fb5aa7b8500 0 ceph version 12.2.0 >>> (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process >>> (unknown), pid 17031 >>> >>>> 2017-09-11 06:41:53.536853 7fb5aa7b8500 0 pidfile_write: ignore >>> empty --pid-file >>> >>>> 2017-09-11 06:41:53.541880 7fb5aa7b8500 1 mgr send_beacon standby >>> >>>> 2017-09-11 06:41:54.547383 7fb5a1aec700 1 mgr handle_mgr_map >>> Activating! >>> >>>> 2017-09-11 06:41:54.547575 7fb5a1aec700 1 mgr handle_mgr_map I am >>> now activating >>> >>>> 2017-09-11 06:41:54.650677 7fb59dae4700 1 mgr start Creating >>> threads for 0 modules >>> >>>> 2017-09-11 06:41:54.650696 7fb59dae4700 1 mgr send_beacon active >>> >>>> 2017-09-11 06:41:55.542252 7fb59eae6700 1 mgr send_beacon active >>> >>>> 2017-09-11 06:41:55.542627 7fb59eae6700 1 mgr.server send_report >>> Not sending PG status to monitor yet, waiting for OSDs >>> >>>> 2017-09-11 06:41:57.542697 7fb59eae6700 1 mgr send_beacon active >>> >>>> [... lots of "send_beacon active" messages ...] >>> >>>> 2017-09-11 07:29:29.640892 7fb59eae6700 1 mgr send_beacon active >>> >>>> 2017-09-11 07:29:30.866366 7fb59d2e3700 -1 *** Caught signal >>> (Aborted) ** >>> >>>> in thread 7fb59d2e3700 thread_name:ms_dispatch >>> >>>> >>> >>>> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) >>> luminous (rc) >>> >>>> 1: (()+0x3de6b4) [0x55f6640e16b4] >>> >>>> 2: (()+0x11390) [0x7fb5a8fef390] >>> >>>> 3: (gsignal()+0x38) [0x7fb5a7f7f428] >>> >>>> 4: (abort()+0x16a) [0x7fb5a7f8102a] >>> >>>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) >>> [0x7fb5a88c284d] >>> >>>> 6: (()+0x8d6b6) [0x7fb5a88c06b6] >>> >>>> 7: (()+0x8d701) [0x7fb5a88c0701] >>> >>>> 8: (()+0x8d919) [0x7fb5a88c0919] >>> >>>> 9: (()+0x2318ad) [0x55f663f348ad] >>> >>>> 10: (()+0x3e91bd) [0x55f6640ec1bd] >>> >>>> 11: (DaemonPerfCounters::update(MMgrReport*)+0x821) >>> [0x55f663f96651] >>> >>>> 12: (DaemonServer::handle_report(MMgrReport*)+0x1ae) >>> [0x55f663f9b79e]+ >>> >>>> 13: (DaemonServer::ms_dispatch(Message*)+0x64) [0x55f663fa8d64] >>> >>>> 14: (DispatchQueue::entry()+0xf4a) [0x55f664438f3a] >>> >>>> 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6641dc44d] >>> >>>> 16: (()+0x76ba) [0x7fb5a8fe56ba] >>> >>>> 17: (clone()+0x6d) [0x7fb5a80513dd] >>> >>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >>> needed to interpret this. >>> >>>> >>> >>>> --- begin dump of recent events --- >>> >>>> [...] >>> >>>> >>> >>>> >>> >>>> I tried to manually run ceph-mgr with >>> >>>>> /usr/bin/ceph-mgr -f --cluster ceph --id $HOSTNAME --setuser ceph >>> --setgroup ceph >>> >>>> which immediately fails to keep running for longer than a few >>> seconds. >>> >>>> stdout: http://xor.meo.ws/OyvoZF8v0aWq0D-rOOg2y6u03fp_yzYv.txt >>> <http://xor.meo.ws/OyvoZF8v0aWq0D-rOOg2y6u03fp_yzYv.txt> >>> >>>> logs: http://xor.meo.ws/jcMyjabCfFbTcfZ8GOangLdSfSSqJffr.txt >>> <http://xor.meo.ws/jcMyjabCfFbTcfZ8GOangLdSfSSqJffr.txt> >>> >>>> objdump: http://xor.meo.ws/oxo2q8h_oKAG6q7mARvNKkR_JdYjn89B.txt >>> <http://xor.meo.ws/oxo2q8h_oKAG6q7mARvNKkR_JdYjn89B.txt> >>> >>>> >>> >>>> Has someone seen such an issue before and knows how to debug or >>> even fix this? >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> Katie >>> >>>> _______________________________________________ >>> >>>> ceph-users mailing list >>> >>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >>> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >>> >>> >>> >>> >>> >>> >>> > >>> > >>> > >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >>> >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com