It turns out that the Slurm job limits are *not* controlled by the normal
/etc/security/limits.conf configuration. Any service running under
Systemd (such as slurmd) has limits defined by Systemd, see [1] and [2].
The limits of processes started by slurmd are defined by LimitXXX in
/usr/lib/s
I looked at some of our busy 96-core nodes where users are currently
running the STAR-CCM+ CFD software.
One job runs on 4 96-core nodes. I'm amazed that each STAR-CCM+ process
has opened almost 1000 open files, for example:
$ lsof -p 440938 | wc -l
950
and that on this node the user has al
Jeffrey T Frey via slurm-users writes:
>> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
>> is per user.
>
> The ulimit is a frontend to rusage limits, which are per-process restrictions
> (not per-user).
You are right; I sit corrected. :)
(Except for number of procs an
As a related point, for this reason I mount /var/log separately from /. Ask
me how I learned that lesson...
Jason
On Tue, Apr 16, 2024 at 8:43 AM Jeffrey T Frey via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
> is p
> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
> is per user.
The ulimit is a frontend to rusage limits, which are per-process restrictions
(not per-user).
The fs.file-max is the kernel's limit on how many file descriptors can be open
in aggregate. You'd have to edit
Ole Holm Nielsen writes:
> Hi Bjørn-Helge,
>
> That sounds interesting, but which limit might affect the kernel's
> fs.file-max? For example, a user already has a narrow limit:
>
> ulimit -n
> 1024
AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
is per user.
Now that I t
Hi Bjørn-Helge,
On 4/16/24 12:08, Bjørn-Helge Mevik via slurm-users wrote:
Ole Holm Nielsen via slurm-users writes:
Therefore I believe that the root cause of the present issue is user
applications opening a lot of files on our 96-core nodes, and we need
to increase fs.file-max.
You could a
Ole Holm Nielsen via slurm-users writes:
> Therefore I believe that the root cause of the present issue is user
> applications opening a lot of files on our 96-core nodes, and we need
> to increase fs.file-max.
You could also set a limit per user, for instance in
/etc/security/limits.d/. Then u
Hi Jeffrey,
Thanks a lot for the information:
On 4/15/24 15:40, Jeffrey T Frey wrote:
https://github.com/dun/munge/issues/94
I hadn't seen issue #94 before, and it seems to be relevant to our
problem. It's probably a good idea to upgrade munge beyond what's
supplied by EL8/EL9. We can bui
https://github.com/dun/munge/issues/94
The NEWS file claims this was fixed in 0.5.15. Since your log doesn't show the
additional strerror() output you're definitely running an older version,
correct?
If you go on one of the affected nodes and do an `lsof -p ` I'm
betting you'll find a long
10 matches
Mail list logo