Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

Robert Longstaff Fri, 13 Jan 2017 15:55:01 -0800

FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS 7
w/ elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has
allocated ~11GB of RAM after a single day of usage. Only the active manager
is performing this way. The growth is linear and reproducible.


The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB OSDs
each.


top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94, 4.21

Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie

%Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7 si,
0.0 st

KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812 buff/cache

KiB Swap:  2097148 total,  2097148 free,        0 used.  4836772 avail Mem


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND





 2351 ceph      20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27
ceph-mgr




 2302 ceph      20   0  620316 267992 157620 S   2.3  1.6  65:11.50
ceph-mon




On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J <
bryan.stillw...@charter.com> wrote:

> John,
>
> This morning I compared the logs from yesterday and I show a noticeable
> increase in messages like these:
>
> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575
> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441
> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all:
> notify_all mon_status
> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all:
> notify_all health
> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all:
> notify_all pg_summary
> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
> mgrdigest v1
> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575
> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441
> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all:
> notify_all mon_status
> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all:
> notify_all health
> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all:
> notify_all pg_summary
> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
> mgrdigest v1
> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
>
>
> In a 1 minute period yesterday I saw 84 times this group of messages
> showed up.  Today that same group of messages showed up 156 times.
>
> Other than that I did see an increase in this messages from 9 times a
> minute to 14 times a minute:
>
> 2017-01-11 09:00:00.402000 7f70f3d61700  0 -- 172.24.88.207:6800/4104 >> -
> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
>
> Let me know if you need anything else.
>
> Bryan
>
>
> On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J"
> <ceph-users-boun...@lists.ceph.com on behalf of
> bryan.stillw...@charter.com> wrote:
>
> >On 1/10/17, 5:35 AM, "John Spray" <jsp...@redhat.com> wrote:
> >
> >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
> >><bryan.stillw...@charter.com> wrote:
> >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
> >>> single node, two OSD cluster, and after a while I noticed that the new
> >>> ceph-mgr daemon is frequently using a lot of the CPU:
> >>>
> >>> 17519 ceph      20   0  850044 168104    208 S 102.7  4.3   1278:27
> >>> ceph-mgr
> >>>
> >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
> >>> usage down to < 1%, but after a while it climbs back up to > 100%.  Has
> >>> anyone else seen this?
> >>
> >>Definitely worth investigating, could you set "debug mgr = 20" on the
> >>daemon to see if it's obviously spinning in a particular place?
> >
> >I've injected that option to the ceps-mgr process, and now I'm just
> >waiting for it to go out of control again.
> >
> >However, I've noticed quite a few messages like this in the logs already:
> >
> >2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2
> >cs=1 l=0).fault initiating reconnect
> >2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg
> >accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
> >2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg
> >accept peer reset, then tried to connect to us, replacing
> >2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to
> >send and in the half  accept state just closed
> >
> >
> >What's weird about that is that this is a single node cluster with
> >ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
> >host.  So none of the communication should be leaving the node.
> >
> >Bryan
>
> E-MAIL CONFIDENTIALITY NOTICE:
> The contents of this e-mail message and any attachments are intended
> solely for the addressee(s) and may contain confidential and/or legally
> privileged information. If you are not the intended recipient of this
> message or if this message has been addressed to you in error, please
> immediately alert the sender by reply e-mail and then delete this message
> and any attachments. If you are not the intended recipient, you are
> notified that any use, dissemination, distribution, copying, or storage of
> this message or any attachment is strictly prohibited.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
- Rob

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

Reply via email to