As a related point, for this reason I mount /var/log separately from /. Ask me how I learned that lesson...
Jason On Tue, Apr 16, 2024 at 8:43 AM Jeffrey T Frey via slurm-users < slurm-users@lists.schedmd.com> wrote: > AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" > is per user. > > > The ulimit is a frontend to rusage limits, which are per-process > restrictions (not per-user). > > The fs.file-max is the kernel's limit on how many file descriptors can be > open in aggregate. You'd have to edit that with sysctl: > > > *$ sysctl fs.file-max* > fs.file-max = 26161449 > > > > Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an alternative > limit versus the default. > > > > > But if you have ulimit -n == 1024, then no user should be able to hit > the fs.file-max limit, even if it is 65536. (Technically, 96 jobs from > 96 users each trying to open 1024 files would do it, though.) > > > Naturally, since the ulimit is per-process the equating of core count with > the multiplier isn't valid. It also assumes Slurm isn't setup to > oversubscribe CPU resources :-) > > > > I'm not sure how the number 3092846 got set, since it's not defined in > /etc/security/limits.conf. The "ulimit -u" varies quite a bit among > our compute nodes, so which dynamic service might affect the limits? > > > If the 1024 is a soft limit, you may have users who are raising it to > arbitrary values themselves, for example. Especially as 1024 is somewhat > low for the more naively-written data science Python code I see on our > systems. If Slurm is configured to propagate submission shell ulimits to > the runtime environment and you allow submission from a variety of > nodes/systems you could be seeing myriad limits reconstituted on the > compute node despite the /etc/security/limits.conf settings. > > > The main question needing an answer is _what_ process(es) are opening all > the files on your systems that are faltering. It's very likely to be user > jobs' opening all of them, I was just hoping to also rule out any bug in > munged. Since you're upgrading munged, you'll now get the errno associated > with the backlog and can confirm EMFILE vs. ENFILE vs. ENOMEM. > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- *Jason L. Simms, Ph.D., M.P.H.* Instructor, Department of Languages & Literary Studies Lafayette College Pardee Hall | One Pardee Dr, 4th Fl | Easton, PA 18042 Office: Pardee 405
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com