This cluster is server RBD storage for openstack, and today all the I/O was
just stopped.
After looking in the boxes ceph-mon was using 17G ram - and this was on
*all* the mons. Restarting the main one just made it work again (I
restarted the other ones because they were using a lot of ram).
This has happened twice now (first was last Monday).

As this is considered a prod cluster there is no logging enabled, and I
can't reproduce it - our test/dev clusters have been working fine, and have
neither symptoms, but they were upgraded from firefly.
What can we do to help debug the issue? Any ideas on how to identify the
underlying issue?

thanks,

On Mon, Jul 20, 2015 at 1:59 PM, Luis Periquito <periqu...@gmail.com> wrote:

> Hi all,
>
> I have a cluster with 28 nodes (all physical, 4Cores, 32GB Ram), each node
> has 4 OSDs for a total of 112 OSDs. Each OSD has 106 PGs (counted including
> replication). There are 3 MONs on this cluster.
> I'm running on Ubuntu trusty with kernel 3.13.0-52-generic, with Hammer
> (0.94.2).
>
> This cluster was installed with Hammer (0.94.1) and has only been upgraded
> to the latest available version.
>
> On the three mons one is mostly idle, one is using ~170% CPU, and one is
> using ~270% CPU. They will change as I restart the process (usually the
> idle one is the one with the lowest uptime).
>
> Running a perf top againt the ceph-mon PID on the non-idle boxes it wields
> something like this:
>
>   4.62%  libpthread-2.19.so    [.] pthread_mutex_unlock
>   3.95%  libpthread-2.19.so    [.] pthread_mutex_lock
>   3.91%  libsoftokn3.so        [.] 0x000000000001db26
>   2.38%  [kernel]              [k] _raw_spin_lock
>   2.09%  libtcmalloc.so.4.1.2  [.] operator new(unsigned long)
>   1.79%  ceph-mon              [.] DispatchQueue::enqueue(Message*, int,
> unsigned long)
>   1.62%  ceph-mon              [.] RefCountedObject::get()
>   1.58%  libpthread-2.19.so    [.] pthread_mutex_trylock
>   1.32%  libtcmalloc.so.4.1.2  [.] operator delete(void*)
>   1.24%  libc-2.19.so          [.] 0x0000000000097fd0
>   1.20%  ceph-mon              [.] ceph::buffer::ptr::release()
>   1.18%  ceph-mon              [.] RefCountedObject::put()
>   1.15%  libfreebl3.so         [.] 0x00000000000542a8
>   1.05%  [kernel]              [k] update_cfs_shares
>   1.00%  [kernel]              [k] tcp_sendmsg
>
> The cluster is mostly idle, and it's healthy. The store is 69MB big, and
> the MONs are consuming around 700MB of RAM.
>
> Any ideas on this situation? Is it safe to ignore?
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to