Hi All, after several month of working, we recently have stability problems with our lustre installation. Each of our seven OSD Server crashes after some hours with kernel messages like
NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [ll_ost_io01_027:24071] The time duration of the messages varies between 22s and 23s and the number after the last colon between 32281, 30488 and 24071. Our environment: CentOS-7.7, (recent kernels 3.10.0-1062.12.1 or 3.10.0-1062.18.1) lustre-2.12.4 on zfs-0.7.13 single rail Omnipath network (mixed mpi and lustre) same behaviour with in kernel omnipath stack and Intel Stack (10.10.1.0.36) At the time of these kernel ll_ost_io messages, the omnipath interface of the failing osd is not longer able to ping (outgoing or ingoing). What i have already done is reducing the ost_io.threads from 132 stepwise down to 40 (server has 32 cpu cores): lctl set_param ost.OSS.ost_io.threads_max=40 Then i changed between kernel 3.10.0-1062.12.1 and 3.10.0-1062.18.1 and between kernel and intel omnipath driver. It is not clear for me if the failing lustre destroys the omnipath of the server or the other way round. Intel Omnipath utlitities (opatop, fmgui) does not show problems in the network (or we did not found them). Other parameters for the OSDs are: # cat /etc/modprobe.d/lustre.conf options lnet networks="o2ib0(ib0)" options ptlrpc at_min=40 at_max=400 ldlm_enqueue_min=260 # cat /etc/modprobe.d/hfi1.conf options hfi1 krcvqs=8 piothreshold=0 sge_copy_mode=2 wss_threshold=70 rcvhdrcnt=4096 cap_mask=0x4c09a01cbba # lctl get_param '*.*.*.threads_max timeout *.*.*.timeout' ldlm.services.ldlm_canceld.threads_max=128 ldlm.services.ldlm_cbd.threads_max=128 ost.OSS.ost.threads_max=132 ost.OSS.ost_create.threads_max=24 ost.OSS.ost_io.threads_max=40 ost.OSS.ost_out.threads_max=24 ost.OSS.ost_seq.threads_max=24 timeout=100 osd-zfs.scratch-OST0000.quota_slave.timeout=50 osd-zfs.scratch-OST0000.quota_slave_dt.timeout=50 ... (for all six OST) Any hints? Bernd Melchers -- Archiv- und Backup-Service | [email protected] Freie Universität Berlin | _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
