Hi All,
This is not a Lustre problem proper, but others might run into it with a 64-bit
Lustre client on RHEL 7, and I hope to save others the time it took us to nail
it down. We saw it on a node running the "Starfish" policy engine, which reads
through the entire file system tree repeatedly and consumes changelogs.
Starfish itself creates and destroys processes frequently, and the workload
causes Lustre to create and destroy threads as well, by triggering statahead
thread creation and changelog thread creation.
For the impatient, the fix was to increase pid_max. We used:
kernel.pid_max=524288
The symptoms are:
1) console log messages like
LustreError: 10525:0:(statahead.c:970:ll_start_agl()) can't start ll_agl
thread, rc: -12
LustreError: 15881:0:(statahead.c:1614:start_statahead_thread()) can't start
ll_sa thread, rc: -12
LustreError: 15881:0:(statahead.c:1614:start_statahead_thread()) Skipped 45
previous similar messages
LustreError: 15878:0:(statahead.c:1614:start_statahead_thread()) can't start
ll_sa thread, rc: -12
LustreError: 15878:0:(statahead.c:1614:start_statahead_thread()) Skipped 17
previous similar messages
Note the return codes are -12, which is -ENOMEM.
Attempts to create new user space processes are also intermittently failing.
sf_lustre.liblustreCmds 10983 'MainThread' : ("can't start new thread",)
[liblustreCmds.py:216]
and
[faaland1@solfish2 lustre]$git fetch llnlstash
Enter passphrase for key '/g/g0/faaland1/.ssh/swdev':
Enter passphrase for key '/g/g0/faaland1/.ssh/swdev':
remote: Enumerating objects: 1377, done.
remote: Counting objects: 100% (1236/1236), done.
remote: Compressing objects: 100% (271/271), done.
error: cannot fork() for index-pack: Cannot allocate memory
fatal: fetch-pack: unable to fork off index-pack
We wasted a lot of time chasing the idea that this was in fact due to
insufficient free memory on the node, but the actual problem was that sysctl
kernel.pid_max was too low.
When a new process must be created via fork() or kthread_create(), or similar,
the kernel has to allocate a PID. It has a data structure for keeping track of
which PIDs are available, and there is some delay after a process is destroyed
before its PID may be reused.
We found that on this node, that the kernel would occasionally find no PIDs
available when it was creating the process. Specifically, copy_process() would
call alloc_pidmap(), which would return -1. This tended to be when the system
was processing a large number of changes on the file system, so both Lustre and
Starifish were suddenly doing a lot of work and both would have been creating
new threads in response to the load. This node has about 700-800 processes
running normally according to top(1). At the time these errors occurred, I
don't know many processes were running or how quickly they were being created
and destroyed.
Ftrace showed this:
| copy_namespaces();
| copy_thread();
| alloc_pid() {
| kmem_cache_alloc() {
| __might_sleep();
| _cond_resched();
| }
| kmem_cache_free();
| }
| exit_task_namespaces() {
| switch_task_namespaces() {
On this particular node, with 32 cores, running RHEL 7, arch x86_64, pid_max
was 36K. We added
kernel.pid_max=524288
to our sysctl.conf which resolved the issue.
I don't expect this to be an issue under RHEL 8 (or clone of your choice),
because in RHEL 8.2 systemd puts a config file in place that sets pid_max to
2^22.
-Olaf
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org