FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS 7 w/ elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has allocated ~11GB of RAM after a single day of usage. Only the active manager is performing this way. The growth is linear and reproducible.
The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB OSDs each. top - 23:45:47 up 1 day, 1:32, 1 user, load average: 3.56, 3.94, 4.21 Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie %Cpu(s): 33.9 us, 28.1 sy, 0.0 ni, 37.3 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st KiB Mem : 16423844 total, 3980500 free, 11556532 used, 886812 buff/cache KiB Swap: 2097148 total, 2097148 free, 0 used. 4836772 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2351 ceph 20 0 12.160g 0.010t 17380 S 203.7 64.8 2094:27 ceph-mgr 2302 ceph 20 0 620316 267992 157620 S 2.3 1.6 65:11.50 ceph-mon On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J < bryan.stillw...@charter.com> wrote: > John, > > This morning I compared the logs from yesterday and I show a noticeable > increase in messages like these: > > 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575 > 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441 > 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all: > notify_all mon_status > 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all: > notify_all health > 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all: > notify_all pg_summary > 2017-01-11 09:00:03.033613 7f70f15c1700 4 mgr ms_dispatch active > mgrdigest v1 > 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1 > 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575 > 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441 > 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all: > notify_all mon_status > 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all: > notify_all health > 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all: > notify_all pg_summary > 2017-01-11 09:00:03.532898 7f70f15c1700 4 mgr ms_dispatch active > mgrdigest v1 > 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1 > > > In a 1 minute period yesterday I saw 84 times this group of messages > showed up. Today that same group of messages showed up 156 times. > > Other than that I did see an increase in this messages from 9 times a > minute to 14 times a minute: > > 2017-01-11 09:00:00.402000 7f70f3d61700 0 -- 172.24.88.207:6800/4104 >> - > conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > > Let me know if you need anything else. > > Bryan > > > On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J" > <ceph-users-boun...@lists.ceph.com on behalf of > bryan.stillw...@charter.com> wrote: > > >On 1/10/17, 5:35 AM, "John Spray" <jsp...@redhat.com> wrote: > > > >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J > >><bryan.stillw...@charter.com> wrote: > >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a > >>> single node, two OSD cluster, and after a while I noticed that the new > >>> ceph-mgr daemon is frequently using a lot of the CPU: > >>> > >>> 17519 ceph 20 0 850044 168104 208 S 102.7 4.3 1278:27 > >>> ceph-mgr > >>> > >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU > >>> usage down to < 1%, but after a while it climbs back up to > 100%. Has > >>> anyone else seen this? > >> > >>Definitely worth investigating, could you set "debug mgr = 20" on the > >>daemon to see if it's obviously spinning in a particular place? > > > >I've injected that option to the ceps-mgr process, and now I'm just > >waiting for it to go out of control again. > > > >However, I've noticed quite a few messages like this in the logs already: > > > >2017-01-10 09:56:07.441678 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> > >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2 > >cs=1 l=0).fault initiating reconnect > >2017-01-10 09:56:07.442044 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> > >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 > >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 > l=0).handle_connect_msg > >accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING > >2017-01-10 09:56:07.442067 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> > >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 > >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 > l=0).handle_connect_msg > >accept peer reset, then tried to connect to us, replacing > >2017-01-10 09:56:07.443026 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> > >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 > >s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to > >send and in the half accept state just closed > > > > > >What's weird about that is that this is a single node cluster with > >ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same > >host. So none of the communication should be leaving the node. > > > >Bryan > > E-MAIL CONFIDENTIALITY NOTICE: > The contents of this e-mail message and any attachments are intended > solely for the addressee(s) and may contain confidential and/or legally > privileged information. If you are not the intended recipient of this > message or if this message has been addressed to you in error, please > immediately alert the sender by reply e-mail and then delete this message > and any attachments. If you are not the intended recipient, you are > notified that any use, dissemination, distribution, copying, or storage of > this message or any attachment is strictly prohibited. > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- - Rob
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com