As a related point, for this reason I mount /var/log separately from /. Ask
me how I learned that lesson...

Jason

On Tue, Apr 16, 2024 at 8:43 AM Jeffrey T Frey via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
> is per user.
>
>
> The ulimit is a frontend to rusage limits, which are per-process
> restrictions (not per-user).
>
> The fs.file-max is the kernel's limit on how many file descriptors can be
> open in aggregate.  You'd have to edit that with sysctl:
>
>
> *$ sysctl fs.file-max*
> fs.file-max = 26161449
>
>
>
> Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an alternative
> limit versus the default.
>
>
>
>
> But if you have ulimit -n == 1024, then no user should be able to hit
> the fs.file-max limit, even if it is 65536.  (Technically, 96 jobs from
> 96 users each trying to open 1024 files would do it, though.)
>
>
> Naturally, since the ulimit is per-process the equating of core count with
> the multiplier isn't valid.  It also assumes Slurm isn't setup to
> oversubscribe CPU resources :-)
>
>
>
> I'm not sure how the number 3092846 got set, since it's not defined in
> /etc/security/limits.conf.  The "ulimit -u" varies quite a bit among
> our compute nodes, so which dynamic service might affect the limits?
>
>
> If the 1024 is a soft limit, you may have users who are raising it to
> arbitrary values themselves, for example.  Especially as 1024 is somewhat
> low for the more naively-written data science Python code I see on our
> systems.  If Slurm is configured to propagate submission shell ulimits to
> the runtime environment and you allow submission from a variety of
> nodes/systems you could be seeing myriad limits reconstituted on the
> compute node despite the /etc/security/limits.conf settings.
>
>
> The main question needing an answer is _what_ process(es) are opening all
> the files on your systems that are faltering.  It's very likely to be user
> jobs' opening all of them, I was just hoping to also rule out any bug in
> munged.  Since you're upgrading munged, you'll now get the errno associated
> with the backlog and can confirm EMFILE vs. ENFILE vs. ENOMEM.
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>


-- 
*Jason L. Simms, Ph.D., M.P.H.*
Instructor, Department of Languages & Literary Studies
Lafayette College
Pardee Hall | One Pardee Dr, 4th Fl | Easton, PA 18042
Office: Pardee 405
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to