Re: [ceph-users] osd exit common/Thread.cc: 160: FAILED assert(ret == 0)--10.2.10

lin zhou Thu, 28 Feb 2019 04:32:02 -0800

Thanks Greg. I found the limit. it is /proc/sys/kernel/threads-max.
I count thread numbers using:
ps -eo nlwp | tail -n +2 | awk '{ num_threads += $1 } END { print
num_threads }'" -o 97981


lin zhou <hnuzhoul...@gmail.com> 于2019年2月28日周四 上午10:33写道：

> Thanks, Greg. Your reply always so fast.
>
> I check my system these limits.
> # ulimit -a
> core file size          (blocks, -c) unlimited
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 257219
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 65535
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 8192
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 257219
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> # cat /proc/sys/kernel/pid_max
> 196608
> # cat /proc/sys/kernel/threads-max
> 100000
> # cat /proc/sys/fs/file-max
> 6584236
> # sysctl fs.file-nr
> fs.file-nr = 55520 0 6584236
> # sysctl fs.file-max
> fs.file-max = 6584236
>
> I try to count all fd:
> # total=0;for pid in `ls /proc/`;do num=`ls  /proc/$pid/fd 2>/dev/null|wc
> -l`;total=$((total+num));done;echo ${total}
> 53727
>
> I check every osd service open files limit is all 32768
> # for pid in `ps aux|grep osd|grep -v grep|awk '{print $2}'`;do cat
> /proc/$pid/limits |grep open;done
> Max open files            32768                32768                files
> Max open files            32768                32768                files
> Max open files            32768                32768                files
> Max open files            32768                32768                files
>
> free -g
>              total       used       free     shared    buffers     cached
> Mem:            62         46         16          0          0          5
> -/+ buffers/cache:         41         21
> Swap:            3          0          3
>
> another situation is 14 osds in five hosts appeared this problem and all
> in the same failure domain so far.
>
> Gregory Farnum <gfar...@redhat.com> 于2019年2月28日周四 上午1:59写道：
>
>> The OSD tried to create a new thread, and the kernel told it no. You
>> probably need to turn up the limits on threads and/or file descriptors.
>> -Greg
>>
>> On Wed, Feb 27, 2019 at 2:36 AM hnuzhoulin2 <hnuzhoul...@gmail.com>
>> wrote:
>>
>>> Hi, guys
>>>
>>> So far, there have been 10 osd service exit because of this error.
>>> the error messages are all the same.
>>>
>>> 2019-02-27 17:14:59.757146 7f89925ff700 0 -- 10.191.175.15:6886/192803
>>> >> 10.191.175.49:6833/188731 pipe(0x55ebba819400 sd=741 :6886 s=0 pgs=0
>>> cs=0 l=0 c=0x55ebbb8ba900).accept connect_seq 3912 vs existing 3911 state
>>> standby
>>> 2019-02-27 17:15:05.858802 7f89d9856700 -1 common/Thread.cc: In function
>>> 'void Thread::create(const char*, size_t)' thread 7f89d9856700 time
>>> 2019-02-27 17:15:05.806607
>>> common/Thread.cc: 160: FAILED assert(ret 0)
>>>
>>> ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x82) [0x55eb7a849e12]
>>>  2: (Thread::create(char const*, unsigned long)+0xba) [0x55eb7a82c14a]
>>>  3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x55eb7a8203ef]
>>>  4: (Accepter::entry()+0x379) [0x55eb7a8f3ee9]
>>>  5: (()+0x8064) [0x7f89ecf76064]
>>>  6: (clone()+0x6d) [0x7f89eb07762d]
>>>  NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is
>>> needed to interpret this.
>>> --- begin dump of recent events ---
>>> 10000> 2019-02-27 17:14:50.999276 7f893e811700 1 -
>>> 10.191.175.15:0/192803 < osd.850 10.191.175.46:6837/190855 6953447 ====
>>> osd_ping(ping_reply e17846 stamp 2019-02-27 17:14:50.995043) v3 ====
>>> 2004+0+0 (3980167553 0 0) 0x55eba12b7400 con 0x55eb96ada600
>>>
>>> detail logs see:
>>> https://drive.google.com/file/d/1fZyhTj06CJlcRjmllaPQMNJknI9gAg6J/view
>>>
>>> when I restart these osd services, it looks works well. But I do not
>>> know if it will happen in the other osds.
>>> And I can not find any error log in the system except the following
>>> dmesg info:
>>>
>>> [三 1月 30 08:14:11 2019] megasas: Command pool (fusion) empty!
>>> [三 1月 30 08:14:11 2019] Couldn't build MFI pass thru cmd
>>> [三 1月 30 08:14:11 2019] Couldn't issue MFI pass thru cmd
>>> [三 1月 30 08:14:11 2019] megasas: Command pool empty!
>>> [三 1月 30 08:14:11 2019] megasas: Failed to get a cmd packet
>>> [三 1月 30 08:14:11 2019] megasas: Command pool empty!
>>> [三 1月 30 08:14:11 2019] megasas: Failed to get a cmd packet
>>> [三 1月 30 08:14:11 2019] megasas: Command pool empty!
>>> [三 1月 30 08:14:11 2019] megasas: Failed to get a cmd packet
>>> [三 1月 30 08:14:11 2019] megasas: Command pool (fusion) empty!
>>> [三 1月 30 08:14:11 2019] megasas: Err returned from build_and_issue_cmd
>>> [三 1月 30 08:14:11 2019] megasas: Command pool (fusion) empty!
>>>
>>> this cluster only used aas rbd cluster,ceph status is below:
>>> root@cld-osd5-44:~# ceph -s
>>> cluster 2bec9425-ea5f-4a48-b56a-fe88e126bced
>>> health HEALTH_WARN
>>> noout flag(s) set
>>> monmap e1: 3 mons at {a=
>>> 10.191.175.249:6789/0,b=10.191.175.250:6789/0,c=10.191.175.251:6789/0}
>>> election epoch 26, quorum 0,1,2 a,b,c
>>> osdmap e17856: 1080 osds: 1080 up, 1080 in
>>> flags noout,sortbitwise,require_jewel_osds
>>> pgmap v25160475: 90112 pgs, 3 pools, 43911 GB data, 17618 kobjects
>>> 139 TB used, 1579 TB / 1718 TB avail
>>> 90108 active+clean
>>> 3 active+clean+scrubbing+deep
>>> 1 active+clean+scrubbing
>>> client io 107 MB/s rd, 212 MB/s wr, 1621 op/s rd, 7555 op/s wr
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd exit common/Thread.cc: 160: FAILED assert(ret == 0)--10.2.10

Reply via email to