Re: [slurm-users] SLURM upgrade from 20.11.3 to 20.11.9 misidentification of job steps

2022-05-19 Thread John DeSantis
ately, the problem remained. Long story short, I upgraded to the latest stable version this morning and the issue appears resolved. Thanks! John DeSantis On 5/19/22 05:41, Luke Sudbery wrote: We ran into a similar issue a while ago (not sure what versions were involved though). Can't guarante

Re: [slurm-users] SLURM upgrade from 20.11.3 to 20.11.9 misidentification of job steps

2022-05-18 Thread John DeSantis
JobAcctGatherParams=OverMemoryKill in our environment to monitor and kill jobs when the physical memory limit has been exceeded. Thank you, John DeSantis On 5/18/22 09:45, John DeSantis wrote: Hello, Due to the recent CVE posted by Tim, we did upgrade from SLURM 20.11.3 to 20.11.9. Today, I receiv

[slurm-users] SLURM upgrade from 20.11.3 to 20.11.9 misidentification of job steps

2022-05-18 Thread John DeSantis
one else seen this? Thank you, John DeSantis OpenPGP_signature Description: OpenPGP digital signature

Re: [slurm-users] Fwd: Using PreemptExemptTime

2022-02-03 Thread John DeSantis
that a job was preempted in the application's output, or within the slurmctld logs. When we switched to PreemptExemptTime, all application output and SLURM logs stated preempted as the reason. I know you want to suspend preempted jobs, but what happens if you cancel them instead? HTH, Jo

Re: [slurm-users] OpenMPI interactive change in behavior?

2021-04-28 Thread John DeSantis
c-1057-30-2 > mdc-1057-30-6 Thanks for that suggestion! I imagine that this could be a bug then, that is specifying "--overlap" with `srun` having no effect while manually setting the variable does. John DeSantis On 4/28/21 11:27 AM, Juergen Salk wrote: > Hi John, > >

Re: [slurm-users] OpenMPI interactive change in behavior?

2021-04-28 Thread John DeSantis
20.02.x, but it seems that this behaviour still exists. Are no other sites on fresh installs of >= SLURM 20.11.3 experiencing this problem? I was aware of the changes in 20.11.{0..2} which received a lot of scrunity, which is why 20.11.3 was selected. Thanks, John DeSantis On 4/26/21 5:12

[slurm-users] OpenMPI interactive change in behavior?

2021-04-26 Thread John DeSantis
is to use `salloc` first, despite version 17.11.9 not needing `salloc` for an "interactive" sessions. Before we go further down this rabbit hole, were other sites affected with a transition from SLURM versions 16.x,17.x,18.x(?) to versions 20.x? If so, did the methodology for multinode interactive MPI sessions change? Thanks! John DeSantis

Re: [slurm-users] Multiple job constraints

2018-06-21 Thread John DeSantis
7;t running, despite "idle" resources, and maintenance of the topology.conf file. John DeSantis On Wed, 20 Jun 2018 12:16:59 -0400 Paul Edmon wrote: > You will get whatever cores Slurm can find which will be an > assortment of hosts. > > -Paul Edmon- > > > On 6/20/201

[slurm-users] Repost: Odd sacct behavior?

2018-05-03 Thread John DeSantis
I'm using slurm 16.05.10-2 and slurmdbd 16.05.10-2. Thanks, John DeSantis

[slurm-users] Odd sacct behavior?

2018-05-02 Thread John DeSantis
I'm using slurm 16.05.10-2 and slurmdbd 16.05.10-2. Thanks, John DeSantis -BEGIN PGP SIGNATURE- iQEzBAEBCgAdFiEEbVacPSiwOGJ0Y8jASZyQGquzmcEFAlrqDKUACgkQSZyQGquz mcHbdggAlBkA9K+97HmDoEZYdbAvN370oFUbrtjnwF5vcbk/tLm5zcnv4xkAoL6H mZlNvWvsapjjztlq4hZ6vAvZ1OnlM++5G0XJ66BEAUmEf

Re: [slurm-users] MaxSubmitJobsPerUser?

2018-04-10 Thread John DeSantis
e avoid this hassle by ensuring that a user has a default qos, e.g. `sacctmgr add user blah defaultqos=blah fairshare=blah` HTH, John DeSantis On Sat, 7 Apr 2018 16:32:40 + Dmitri Chebotarov wrote: > The MaxSubmitJobsPerUser seems to be working when QOS where > MaxSubmitJobsPerUser is defin

Re: [slurm-users] slurmstepd: error: Exceeded job memory limit at some point.

2018-02-14 Thread John DeSantis
because their memory usage spiked during a JobAcctGatherFrequency sampling interval (every 30 seconds, adjusted within slurm.conf). John DeSantis On Wed, 14 Feb 2018 13:05:41 +0100 Loris Bennett wrote: > Geert Kapteijns writes: > > > Hi everyone, > > > > I’m running int

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-17 Thread John DeSantis
ross the 6K > nodes. Ok, that makes sense. Looking initially at your partition definitions, I immediately thought of being DRY, especially since the "finer" tuning between the partitions could easily be controlled via the QOS' allowed to access the resources. John DeSantis O

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-16 Thread John DeSantis
Matthieu, > I would bet on something like LDAP requests taking too much time > because of a missing sssd cache. Good point! It's easy to forget to check something as "simple" as user look-up when something is taking "too long". John DeSantis On Tue, 16 J

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-16 Thread John DeSantis
d defining a MaxWall via each QOS (since one partition has 04:00:00 and the other 03:00:00). The same could be done for the partitions skl_fua_{prod,bprod,lprod} as well. HTH, John DeSantis On Tue, 16 Jan 2018 11:22:44 +0100 Alessandro Federico wrote: > Hi, > > setting MessageTimeout

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-12 Thread John DeSantis
g pertaining to the server thread count being over its limit. HTH, John DeSantis On Fri, 12 Jan 2018 11:32:57 +0100 Alessandro Federico wrote: > Hi all, > > > we are setting up SLURM 17.11.2 on a small test cluster of about 100 > nodes. Sometimes we get the error in the subj

Re: [slurm-users] Queue size, slow/unresponsive head node

2018-01-12 Thread John DeSantis
Colas, We had a similar experience a long time ago, and we solved it by adding the following SchedulerParameters: max_rpc_cnt=150,defer HTH, John DeSantis On Thu, 11 Jan 2018 16:39:43 -0500 Colas Rivière wrote: > Hello, > > I'm managing a small cluster (one head node, 24 worke

Re: [slurm-users] [slurm-dev] SLURM 16.05.10-2 jobacct_gather/linux inconsistencies?

2017-12-12 Thread John DeSantis
uxproc; sadly this is an artifact from our testing days - and was never changed after our move to production! RTFM, dude! John DeSantis On Tue, 5 Sep 2017 11:40:15 -0600 John DeSantis wrote: > > Hello all, > > We were recently alerted by a user whose long running jobs (>= 6