Thanks Greg. I found the limit. it is /proc/sys/kernel/threads-max. I count thread numbers using: ps -eo nlwp | tail -n +2 | awk '{ num_threads += $1 } END { print num_threads }'" -o 97981
lin zhou <hnuzhoul...@gmail.com> 于2019年2月28日周四 上午10:33写道: > Thanks, Greg. Your reply always so fast. > > I check my system these limits. > # ulimit -a > core file size (blocks, -c) unlimited > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 257219 > max locked memory (kbytes, -l) 64 > max memory size (kbytes, -m) unlimited > open files (-n) 65535 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 8192 > cpu time (seconds, -t) unlimited > max user processes (-u) 257219 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > # cat /proc/sys/kernel/pid_max > 196608 > # cat /proc/sys/kernel/threads-max > 100000 > # cat /proc/sys/fs/file-max > 6584236 > # sysctl fs.file-nr > fs.file-nr = 55520 0 6584236 > # sysctl fs.file-max > fs.file-max = 6584236 > > I try to count all fd: > # total=0;for pid in `ls /proc/`;do num=`ls /proc/$pid/fd 2>/dev/null|wc > -l`;total=$((total+num));done;echo ${total} > 53727 > > I check every osd service open files limit is all 32768 > # for pid in `ps aux|grep osd|grep -v grep|awk '{print $2}'`;do cat > /proc/$pid/limits |grep open;done > Max open files 32768 32768 files > Max open files 32768 32768 files > Max open files 32768 32768 files > Max open files 32768 32768 files > > free -g > total used free shared buffers cached > Mem: 62 46 16 0 0 5 > -/+ buffers/cache: 41 21 > Swap: 3 0 3 > > another situation is 14 osds in five hosts appeared this problem and all > in the same failure domain so far. > > Gregory Farnum <gfar...@redhat.com> 于2019年2月28日周四 上午1:59写道: > >> The OSD tried to create a new thread, and the kernel told it no. You >> probably need to turn up the limits on threads and/or file descriptors. >> -Greg >> >> On Wed, Feb 27, 2019 at 2:36 AM hnuzhoulin2 <hnuzhoul...@gmail.com> >> wrote: >> >>> Hi, guys >>> >>> So far, there have been 10 osd service exit because of this error. >>> the error messages are all the same. >>> >>> 2019-02-27 17:14:59.757146 7f89925ff700 0 -- 10.191.175.15:6886/192803 >>> >> 10.191.175.49:6833/188731 pipe(0x55ebba819400 sd=741 :6886 s=0 pgs=0 >>> cs=0 l=0 c=0x55ebbb8ba900).accept connect_seq 3912 vs existing 3911 state >>> standby >>> 2019-02-27 17:15:05.858802 7f89d9856700 -1 common/Thread.cc: In function >>> 'void Thread::create(const char*, size_t)' thread 7f89d9856700 time >>> 2019-02-27 17:15:05.806607 >>> common/Thread.cc: 160: FAILED assert(ret 0) >>> >>> ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) >>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>> const*)+0x82) [0x55eb7a849e12] >>> 2: (Thread::create(char const*, unsigned long)+0xba) [0x55eb7a82c14a] >>> 3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x55eb7a8203ef] >>> 4: (Accepter::entry()+0x379) [0x55eb7a8f3ee9] >>> 5: (()+0x8064) [0x7f89ecf76064] >>> 6: (clone()+0x6d) [0x7f89eb07762d] >>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >>> needed to interpret this. >>> --- begin dump of recent events --- >>> 10000> 2019-02-27 17:14:50.999276 7f893e811700 1 - >>> 10.191.175.15:0/192803 < osd.850 10.191.175.46:6837/190855 6953447 ==== >>> osd_ping(ping_reply e17846 stamp 2019-02-27 17:14:50.995043) v3 ==== >>> 2004+0+0 (3980167553 0 0) 0x55eba12b7400 con 0x55eb96ada600 >>> >>> detail logs see: >>> https://drive.google.com/file/d/1fZyhTj06CJlcRjmllaPQMNJknI9gAg6J/view >>> >>> when I restart these osd services, it looks works well. But I do not >>> know if it will happen in the other osds. >>> And I can not find any error log in the system except the following >>> dmesg info: >>> >>> [三 1月 30 08:14:11 2019] megasas: Command pool (fusion) empty! >>> [三 1月 30 08:14:11 2019] Couldn't build MFI pass thru cmd >>> [三 1月 30 08:14:11 2019] Couldn't issue MFI pass thru cmd >>> [三 1月 30 08:14:11 2019] megasas: Command pool empty! >>> [三 1月 30 08:14:11 2019] megasas: Failed to get a cmd packet >>> [三 1月 30 08:14:11 2019] megasas: Command pool empty! >>> [三 1月 30 08:14:11 2019] megasas: Failed to get a cmd packet >>> [三 1月 30 08:14:11 2019] megasas: Command pool empty! >>> [三 1月 30 08:14:11 2019] megasas: Failed to get a cmd packet >>> [三 1月 30 08:14:11 2019] megasas: Command pool (fusion) empty! >>> [三 1月 30 08:14:11 2019] megasas: Err returned from build_and_issue_cmd >>> [三 1月 30 08:14:11 2019] megasas: Command pool (fusion) empty! >>> >>> this cluster only used aas rbd cluster,ceph status is below: >>> root@cld-osd5-44:~# ceph -s >>> cluster 2bec9425-ea5f-4a48-b56a-fe88e126bced >>> health HEALTH_WARN >>> noout flag(s) set >>> monmap e1: 3 mons at {a= >>> 10.191.175.249:6789/0,b=10.191.175.250:6789/0,c=10.191.175.251:6789/0} >>> election epoch 26, quorum 0,1,2 a,b,c >>> osdmap e17856: 1080 osds: 1080 up, 1080 in >>> flags noout,sortbitwise,require_jewel_osds >>> pgmap v25160475: 90112 pgs, 3 pools, 43911 GB data, 17618 kobjects >>> 139 TB used, 1579 TB / 1718 TB avail >>> 90108 active+clean >>> 3 active+clean+scrubbing+deep >>> 1 active+clean+scrubbing >>> client io 107 MB/s rd, 212 MB/s wr, 1621 op/s rd, 7555 op/s wr >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com