ately, the
problem remained. Long story short, I upgraded to the latest stable version
this morning and the issue appears resolved.
Thanks!
John DeSantis
On 5/19/22 05:41, Luke Sudbery wrote:
We ran into a similar issue a while ago (not sure what versions were involved
though). Can't guarante
JobAcctGatherParams=OverMemoryKill in our environment to monitor and kill jobs
when the physical memory limit has been exceeded.
Thank you,
John DeSantis
On 5/18/22 09:45, John DeSantis wrote:
Hello,
Due to the recent CVE posted by Tim, we did upgrade from SLURM 20.11.3 to
20.11.9.
Today, I receiv
one else seen this?
Thank you,
John DeSantis
OpenPGP_signature
Description: OpenPGP digital signature
that a
job was preempted in the application's output, or within the slurmctld logs.
When we switched to PreemptExemptTime, all application output and SLURM logs
stated preempted as the reason.
I know you want to suspend preempted jobs, but what happens if you cancel them
instead?
HTH,
Jo
c-1057-30-2
> mdc-1057-30-6
Thanks for that suggestion!
I imagine that this could be a bug then, that is specifying "--overlap" with
`srun` having no effect while manually setting the variable does.
John DeSantis
On 4/28/21 11:27 AM, Juergen Salk wrote:
> Hi John,
>
>
20.02.x, but it seems that this
behaviour still exists. Are no other sites on fresh installs of >= SLURM
20.11.3 experiencing this problem?
I was aware of the changes in 20.11.{0..2} which received a lot of scrunity,
which is why 20.11.3 was selected.
Thanks,
John DeSantis
On 4/26/21 5:12
is to use
`salloc` first, despite version 17.11.9 not needing `salloc` for an
"interactive" sessions.
Before we go further down this rabbit hole, were other sites affected with a
transition from SLURM versions 16.x,17.x,18.x(?) to versions 20.x? If so, did
the methodology for multinode interactive MPI sessions change?
Thanks!
John DeSantis
7;t running, despite "idle"
resources, and maintenance of the topology.conf file.
John DeSantis
On Wed, 20 Jun 2018 12:16:59 -0400
Paul Edmon wrote:
> You will get whatever cores Slurm can find which will be an
> assortment of hosts.
>
> -Paul Edmon-
>
>
> On 6/20/201
I'm using slurm 16.05.10-2 and slurmdbd 16.05.10-2.
Thanks,
John DeSantis
I'm using slurm 16.05.10-2 and slurmdbd 16.05.10-2.
Thanks,
John DeSantis
-BEGIN PGP SIGNATURE-
iQEzBAEBCgAdFiEEbVacPSiwOGJ0Y8jASZyQGquzmcEFAlrqDKUACgkQSZyQGquz
mcHbdggAlBkA9K+97HmDoEZYdbAvN370oFUbrtjnwF5vcbk/tLm5zcnv4xkAoL6H
mZlNvWvsapjjztlq4hZ6vAvZ1OnlM++5G0XJ66BEAUmEf
e avoid this hassle by ensuring that a user has a default qos, e.g.
`sacctmgr add user blah defaultqos=blah fairshare=blah`
HTH,
John DeSantis
On Sat, 7 Apr 2018 16:32:40 +
Dmitri Chebotarov wrote:
> The MaxSubmitJobsPerUser seems to be working when QOS where
> MaxSubmitJobsPerUser is defin
because their memory usage spiked during a JobAcctGatherFrequency
sampling interval (every 30 seconds, adjusted within slurm.conf).
John DeSantis
On Wed, 14 Feb 2018 13:05:41 +0100
Loris Bennett wrote:
> Geert Kapteijns writes:
>
> > Hi everyone,
> >
> > I’m running int
ross the 6K
> nodes.
Ok, that makes sense. Looking initially at your partition definitions,
I immediately thought of being DRY, especially since the "finer" tuning
between the partitions could easily be controlled via the QOS' allowed
to access the resources.
John DeSantis
O
Matthieu,
> I would bet on something like LDAP requests taking too much time
> because of a missing sssd cache.
Good point! It's easy to forget to check something as "simple" as user
look-up when something is taking "too long".
John DeSantis
On Tue, 16 J
d defining a MaxWall via each QOS
(since one partition has 04:00:00 and the other 03:00:00).
The same could be done for the partitions skl_fua_{prod,bprod,lprod} as
well.
HTH,
John DeSantis
On Tue, 16 Jan 2018 11:22:44 +0100
Alessandro Federico wrote:
> Hi,
>
> setting MessageTimeout
g
pertaining to the server thread count being over its limit.
HTH,
John DeSantis
On Fri, 12 Jan 2018 11:32:57 +0100
Alessandro Federico wrote:
> Hi all,
>
>
> we are setting up SLURM 17.11.2 on a small test cluster of about 100
> nodes. Sometimes we get the error in the subj
Colas,
We had a similar experience a long time ago, and we solved it by adding
the following SchedulerParameters:
max_rpc_cnt=150,defer
HTH,
John DeSantis
On Thu, 11 Jan 2018 16:39:43 -0500
Colas Rivière wrote:
> Hello,
>
> I'm managing a small cluster (one head node, 24 worke
uxproc; sadly this
is an artifact from our testing days - and was never changed after our
move to production!
RTFM, dude!
John DeSantis
On Tue, 5 Sep 2017 11:40:15 -0600
John DeSantis wrote:
>
> Hello all,
>
> We were recently alerted by a user whose long running jobs (>= 6
18 matches
Mail list logo