The ceph-mon is already taking a lot of memory, and I ran a heap stats ------------------------------------------------ MALLOC: 32391696 ( 30.9 MiB) Bytes in use by application MALLOC: + 27597135872 (26318.7 MiB) Bytes in page heap freelist MALLOC: + 16598552 ( 15.8 MiB) Bytes in central cache freelist MALLOC: + 14693536 ( 14.0 MiB) Bytes in transfer cache freelist MALLOC: + 17441592 ( 16.6 MiB) Bytes in thread cache freelists MALLOC: + 116387992 ( 111.0 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 27794649240 (26507.0 MiB) Actual memory used (physical + swap) MALLOC: + 26116096 ( 24.9 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 27820765336 (26531.9 MiB) Virtual address space used MALLOC: MALLOC: 5683 Spans in use MALLOC: 21 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------
after that I ran the heap release and it went back to normal. ------------------------------------------------ MALLOC: 22919616 ( 21.9 MiB) Bytes in use by application MALLOC: + 4792320 ( 4.6 MiB) Bytes in page heap freelist MALLOC: + 18743448 ( 17.9 MiB) Bytes in central cache freelist MALLOC: + 20645776 ( 19.7 MiB) Bytes in transfer cache freelist MALLOC: + 18456088 ( 17.6 MiB) Bytes in thread cache freelists MALLOC: + 116387992 ( 111.0 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 201945240 ( 192.6 MiB) Actual memory used (physical + swap) MALLOC: + 27618820096 (26339.4 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 27820765336 (26531.9 MiB) Virtual address space used MALLOC: MALLOC: 5639 Spans in use MALLOC: 29 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ So it just seems the monitor is not returning unused memory into the OS or reusing already allocated memory it deems as free... On Wed, Jul 22, 2015 at 4:29 PM, Luis Periquito <periqu...@gmail.com> wrote: > This cluster is server RBD storage for openstack, and today all the I/O > was just stopped. > After looking in the boxes ceph-mon was using 17G ram - and this was on > *all* the mons. Restarting the main one just made it work again (I > restarted the other ones because they were using a lot of ram). > This has happened twice now (first was last Monday). > > As this is considered a prod cluster there is no logging enabled, and I > can't reproduce it - our test/dev clusters have been working fine, and have > neither symptoms, but they were upgraded from firefly. > What can we do to help debug the issue? Any ideas on how to identify the > underlying issue? > > thanks, > > On Mon, Jul 20, 2015 at 1:59 PM, Luis Periquito <periqu...@gmail.com> > wrote: > >> Hi all, >> >> I have a cluster with 28 nodes (all physical, 4Cores, 32GB Ram), each >> node has 4 OSDs for a total of 112 OSDs. Each OSD has 106 PGs (counted >> including replication). There are 3 MONs on this cluster. >> I'm running on Ubuntu trusty with kernel 3.13.0-52-generic, with Hammer >> (0.94.2). >> >> This cluster was installed with Hammer (0.94.1) and has only been >> upgraded to the latest available version. >> >> On the three mons one is mostly idle, one is using ~170% CPU, and one is >> using ~270% CPU. They will change as I restart the process (usually the >> idle one is the one with the lowest uptime). >> >> Running a perf top againt the ceph-mon PID on the non-idle boxes it >> wields something like this: >> >> 4.62% libpthread-2.19.so [.] pthread_mutex_unlock >> 3.95% libpthread-2.19.so [.] pthread_mutex_lock >> 3.91% libsoftokn3.so [.] 0x000000000001db26 >> 2.38% [kernel] [k] _raw_spin_lock >> 2.09% libtcmalloc.so.4.1.2 [.] operator new(unsigned long) >> 1.79% ceph-mon [.] DispatchQueue::enqueue(Message*, int, >> unsigned long) >> 1.62% ceph-mon [.] RefCountedObject::get() >> 1.58% libpthread-2.19.so [.] pthread_mutex_trylock >> 1.32% libtcmalloc.so.4.1.2 [.] operator delete(void*) >> 1.24% libc-2.19.so [.] 0x0000000000097fd0 >> 1.20% ceph-mon [.] ceph::buffer::ptr::release() >> 1.18% ceph-mon [.] RefCountedObject::put() >> 1.15% libfreebl3.so [.] 0x00000000000542a8 >> 1.05% [kernel] [k] update_cfs_shares >> 1.00% [kernel] [k] tcp_sendmsg >> >> The cluster is mostly idle, and it's healthy. The store is 69MB big, and >> the MONs are consuming around 700MB of RAM. >> >> Any ideas on this situation? Is it safe to ignore? >> > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com