i just checked several of the osds running in the environment and the hard
and soft limits for the number of processes is set to 257486. if its
exceeding that, than it seems like there would still be a bug somewhere. i
can't imagine it needing that many.

$ for N in `pidof ceph-osd`; do echo ${N}; sudo grep processes
/proc/${N}/limits; done
8761
Max processes             257486               257486
processes
7744
Max processes             257486               257486
processes
5536
Max processes             257486               257486
processes
4717
Max processes             257486               257486
processes

i did go looking through the ceph init script and didn't find where that
was getting set and no reference to setrlimit in the code. so i'm not sure
how that gets set.

this did lead me into looking at how many threads were getting created per
process and how many there were total on the system. it looks like there
are a total of just over 30k total tasks (pids and threads) on the systems.
i just set kernel.pid_max to 64k and will keep an eye on it. it would make
sense that this is the problem. i'm a little surprised to see it get this
close with only 12 osds running. it looks like they're creating over 2500
threads each. i don't know the internals of the code but that seems like a
lot. oh well. hopefully this fixes it.

mike

On Mon, Mar 7, 2016 at 1:55 PM, Gregory Farnum <gfar...@redhat.com> wrote:

> On Mon, Mar 7, 2016 at 11:04 AM, Mike Lovell <mike.lov...@endurance.com>
> wrote:
> > first off, hello all. this is my first time posting to the list.
> >
> > i have seen a recurring problem that has starting in the past week or so
> on
> > one of my ceph clusters. osds will crash and it seems to happen whenever
> > backfill or recovery is started. looking at the logs it appears that the
> the
> > osd is asserting in src/common/Thread.cc when it tries to create a new
> > thread. these osds are running 0.94.5 and i believe
> > https://github.com/ceph/ceph/blob/v0.94.5/src/common/Thread.cc#L129 is
> the
> > assert that is being hit. i looked back through the code for a couple
> > minutes and it looks like its asserting on pthread_create returning
> > something besides 0. i'm not sure why pthread_create would be failing
> and it
> > looks like it just writes what the return code is to stderr. i also
> wasn't
> > able to determine where the output of stderr ended up from my osds. it
> looks
> > like from looking at /proc/<pid>/fd/{0,1,2} and lsof that stderr is a
> unix
> > socket but i don't see where it goes after that. the osds are started by
> > ceph-disk activate.
> >
> > do any of you have any ideas as to what might be causing this? or how i
> > might further troubleshoot this? i'm attaching a trimmed version of the
> osd
> > log. i removed some extraneous bits from after the osds was restarted
> and a
> > large amount of 'recent events' that were from well before the crash.
>
> Usually you just need to increase the ulimits for thread/process
> counts, on the ceph user account or on the system as a whole. Check
> the docs and the startup scripts.
> -Greg
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to