Hello!
We are having an issue with high priority gpu jobs blocking low priority cpu
only jobs.
Our cluster is setup with one partition, "all". All nodes reside in this
cluster. In this all partition we have four generations of compute nodes,
including gpu nodes. We do this to make use of those
Hello,
Has anyone observed "sleep 1" processes on their compute nodes? They
seem to be tied to the slurmstepd extern process in slurm:
4 S root 136777 1 0 80 0 - 73218 do_wai 05:48 ?00:00:01
slurmstepd: [13220317.extern]
0 S root 136782 136777 0 80 0 - 25229
Hi,
We've just recently installed slurm 17.11.9 and noticed an issue with sshare:
sshare: error: plugin_load_from_file:
dlopen(/usr/lib64/slurm/priority_multifactor.so):
/usr/lib64/slurm/priority_multifactor.so: undefined symbol: sort_part_tier
sshare: error: Couldn't load specified plugin name
-1167
On 8/20/18, 1:21 PM, "slurm-users on behalf of Christopher Benjamin Coffey"
wrote:
Hi,
We've just recently installed slurm 17.11.9 and noticed an issue with
sshare:
sshare: error: plugin_load_from_file:
dlopen(/usr/lib64/slurm/priority_multifactor.so
.11.10.
-Paul Edmon-
On 08/24/2018 02:55 PM, Christopher Benjamin Coffey wrote:
> Odd that no one has this issue. Must be a site issue then? If so, can't
think of what that would be. I suppose we may wait for .10 to be released where
it looks like this may
Hi Jodie,
The only thing that I've gotten working so far is this:
sudo -u slurm bash -c "strigger --set -D -n cn15 -p
/common/adm/slurm/triggers/nodestatus"
So, that will run the nodestatus script which emails when the node cn15 gets
set into drain state. What I'd like to do, which I haven't p
Killian, thank you very much! Never noticed the perm flag!
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 9/19/18, 10:01 AM, "slurm-users on behalf of Kilian Cavalotti"
wrote:
On Wed, Sep 19, 2018 at 9:21 AM Christophe
Hi David,
I'd recommend the following that I've learned from bad experiences upgrading
between the last major version.
1. Consider upgrading to mysql-server 5.5 or greater
2. Purge/archive unneeded jobs/steps before the upgrade, to make the upgrade as
quick as possible:
slurmdbd.conf:
Hi,
I have a user trying to setup a heterogeneous job with one MPI_COMM_WORLD with
the following:
==
#!/bin/bash
#SBATCH --job-name=hetero
#SBATCH --output=/scratch/cbc/hetero.txt
#SBATCH --time=2:00
#SBATCH --workdir=/scratch/cbc
#SBATCH --cpu
er if it is ill advised to enable it!? Suppose I could try it. Thanks
Chris!
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 10/10/18, 12:11 AM, "slurm-users on behalf of Chris Samuel"
wrote:
On 10/10/18 05:07, Christophe
In addition, fwiw, this login node will have a second network connection of
course for campus with firewall setup to only allow ssh (and other essential)
from campus. Also you may consider having some script developed to prevent
folks from abusing the login node instead of using slurm for their
Hi,
I can't figure out how one would create a reservation to reserve a gres unit,
such as a gpu. The man page doesn't really say that gres is supported for a
reservation, but it does say tres is supported. Yet, I can't seem to figure out
how one could specify a gpu with tres. I've tried:
scon
Hi,
My question is in regard to the scheduling parameter: assoc_limit_stop
"If set and a job cannot start due to association limits, then do not attempt
to initiate any lower priority jobs in that partition. Setting this can
decrease system throughput and utilization, but avoid potentially st
you cannot.
On Mon, Oct 22, 2018, 11:51 Christopher Benjamin Coffey
wrote:
Hi,
I can't figure out how one would create a reservation to reserve a gres
unit, such as a gpu. The man page doesn't really say that gres is supported for
a reservation, but
Can anyone else confirm that it is not possible to reserve a GPU? Seems a bit
strange.
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 10/22/18, 10:01 AM, "slurm-users on behalf of Christopher Benjamin Coffey"
wrote:
Hi,
I gave a presentation at SC in the slurm booth on some slurm job efficiency
tools, and web app that we developed. I figured that maybe others in this group
could be interested too. If you'd like to see the short presentation, and the
tools, and links to them, please see this presentation:
Hi Chris,
Are you using the built in slurm x11 support? Or that spank plugin? We haven't
been able to get the right combo of things in place to get the built in x11 to
work.
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 11/15/18, 5
Hi,
We've been noticing an issue with nodes from time to time that become "wedged",
or unusable. This is a state where ps, and w hang. We've been looking into this
for a while when we get time and finally put some more effort into it
yesterday. We came across this blog which describes almost th
group, to see if it had been
touched by this process.
On Fri, 30 Nov 2018 at 09:31, Ole Holm Nielsen
wrote:
On 29-11-2018 19:27, Christopher Benjamin Coffey wrote:
> We've been noticing an
Is this parameter applied to each cgroup? Or just the system itself? Seems like
just the system itself.
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 12/4/18, 10:13 AM, "slurm-users on behalf of Christopher Benjamin Coffey&quo
Hi Guys,
It appears that slurm currently doesn't support mysql 8.0. After upgrading from
5.7 to 8.0 slurm commands that hit the db result in:
sacct: error: slurmdbd: "Unknown error 1064"
This is at least true for version 17.11.12. I wonder what the plan is for
slurm to support mariadb, and my
So this issue is occurring only with job arrays.
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 12/21/18, 12:15 PM, "slurm-users on behalf of Chance Bryce Carl Nelson"
wrote:
Hi folks,
calling sacct with the usercpu flag enable
magnitude.
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 12/21/18, 2:41 PM, "Christopher Benjamin Coffey"
wrote:
So this issue is occurring only with job arrays.
—
Christopher Coffey
High-Performance Co
hoping that this note in the 18.08.4 NEWS might have been related:
-- Fix jobacct_gather/cgroup to work correctly when more than one task is
started on a node.
Thanks,
Paddy
On Fri, Jan 04, 2019 at 03:19:18PM +, Christopher Benjamin Coffey wrote:
&
ue?
Just to note: there's a big warning in the man page not to adjust the
value of JobAcctGatherType while there are any running job steps. I'm not
sure if that means just on that node, or any jobs. Probably safest to
schedule a downtime to change it.
Paddy
Thanks... looks like the bug should get some attention now that a paying site
is complaining:
https://bugs.schedmd.com/show_bug.cgi?id=6332
Thanks Jurij!
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 1/9/19, 7:24 AM, "slurm-users on
Hi D.J.,
I noticed you have:
PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME,FAIR_TREE
I'm pretty sure it does not makes sense to have depth oblivious, and fair tree
set at the same time. You'll want to choose one of them. That’s not going to be
reason for the issue however, but you are l
We've attempted setting JobAcctGatherFrequency=task=0 and there is no change.
We have settings:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup
Odd ... wonder why we don't see it help.
Here is how we verify:
===
#!/bin/bash
#SBATCH --
Hi David,
You are welcome. I'm surprised that srun does not work for you. We advise our
users to use srun on every type of job, not just MPI. This in our opinion keeps
it simple, and it just works. What is your MpiDefault set to in slurm.conf? Is
your openmpi built with slurm support? I believe
Hi, we upgraded to 18.08.5 this morning and are seeing odd errors in the
slurmctld logs:
[2019-01-31T08:24:13.684] error: select_nodes: calling _get_req_features() for
JobId=16599048 with not NULL job resources
[2019-01-31T08:24:13.685] error: select_nodes: calling _get_req_features() for
JobId
On 1/31/19, 8:30 AM, "slurm-users on behalf of Christopher Benjamin Coffey"
wrote:
Hi, we upgraded to 18.08.5 this morning and are seeing odd errors in the
slurmctld logs:
[2019-01-31T08:24:13.684] error: select_nodes: calling _get_req_features()
for JobId=16599048 wi
hings appear
to work as normal.
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 1/31/19, 9:23 AM, "slurm-users on behalf of Christopher Samuel"
wrote:
On 1/31/19 8:12 AM, Christopher Benjamin Coffey wrote:
>
Nico, yep that’s a very annoying bug as we do the same here with job
efficiency. It was patched in 18.08.05. However the db still needs to be
cleaned up. We are working on a script to fix this. When we are done, we'll
offer it up to the list.
Best,
Chris
—
Christopher Coffey
High-Performance C
Hi Loris,
Odd, we never saw that issue with memory efficiency being out of whack, just
the cpu efficiency. We are running 18.08.5-2 and here is a 512 core job run
last night:
Job ID: 18096693
Array Job ID: 18096693_5
Cluster: monsoon
User/Group: abc123/cluster
State: COMPLETED (exit code 0)
Nod
00 MB/core)
which looks good. I'll see how it goes with longer running job.
Thanks for the input,
Loris
Christopher Benjamin Coffey writes:
> Hi Loris,
>
> Odd, we never saw that issue with memory efficiency being out of whack,
jus
Hi Chad,
My memory is a little hazy on how this was setup but ...
man slurm.conf
MailProg
Fully qualified pathname to the program used to send email per user request.
The default value is "/bin/mail" (or "/usr/bin/mail" if "/bin/mail" does not
exist but "/usr/bin/mail" does exist).
Slurm is ca
Chad,
Hah! Just reread the man page.
If you use this:
MailDomain
Domain name to qualify usernames if email address is not explicitly given with
the "--mail-user" option. If unset, the local MTA will need to qualify local
address itself.
Shouldn't need to worry about the .forward stuff if you
Loris,
Glad you've made some progress.
We finally got it working as well, and have two findings:
1. the login node fqdn must be the same as the compute nodes
2. --x11 is not required to be added to srun and actually causes it to fail for
some reason for us. Very odd, anyone have thoughts?
- No
Hi All,
We created a slurm job script archiver which you may find handy. We initially
attempted to do this through slurm with a slurmctld prolog but it really bogged
the scheduler down. This new solution is a custom c++ program that uses inotify
to watch for job scripts and environment files to
Hi, you may want to look into increasing the sssd cache length on the nodes,
and improving the network connectivity to your ldap directory. I recall when
playing with sssd in the past that it wasn't actually caching. Verify with
tcpdump, and "ls -l" through a directory. Once the uid/gid is resol
fied version, let me know.
Kind regards,
Lech
> Am 09.05.2019 um 17:37 schrieb Christopher Benjamin Coffey
:
>
> Hi All,
>
> We created a slurm job script archiver which you may find handy. We
initially attempted to do this through
ted in our modified version, let me know.
Kind regards,
Lech
> Am 09.05.2019 um 17:37 schrieb Christopher Benjamin Coffey
:
>
> Hi All,
>
> We created a slurm job script archiver which you may find handy. We
initially attempted to do this through
Thanks Kevin, we'll put a fix in for that.
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 6/17/19, 12:04 AM, "Kevin Buckley" wrote:
On 2019/05/09 23:37, Christopher Benjamin Coffey wrote:
> Feel free
Hi Kevin,
We fixed the issue on github. Thanks!
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 6/17/19, 8:56 AM, "slurm-users on behalf of Christopher Benjamin Coffey"
wrote:
Thanks Kevin, we'll put a
Hi All,
Has anyone had issues with sshare segfaulting? Specifically with "sshare -l"?
Any suggestions on how to figure this one out? Maybe there is something obvious
I'm not seeing. This has been happening for many slurm versions, I can't recall
when it started. For the last couple versions I'v
Hi,
Excuse me if this has been explained somewhere, I did some searching. With
19.05, is there any reason to have gres.conf on the GPU nodes? Is slurm smart
enough to enumerate the /dev/nvidia* devices? We are moving to 19.05 shortly,
any gotchas with GRES and GPUs? Also, I'm guessing now, the
University
928-523-1167
On 8/12/19, 10:28 PM, "slurm-users on behalf of Chris Samuel"
wrote:
On Monday, 12 August 2019 11:42:48 AM PDT Christopher Benjamin Coffey wrote:
> Excuse me if this has been explained somewhere, I did some searching. With
> 19.05, is there an
Hi,
It seems that --workdir= is no longer a valid option in batch jobs and srun in
19.05, and has been replaced by --chdir. I didn't see a change log about this,
did I miss it? Going through the man pages it seems it hasn't existed for some
time now actually! Maybe not since before 17.11 series
lf of Christopher Benjamin Coffey"
wrote:
Hi,
It seems that --workdir= is no longer a valid option in batch jobs and srun
in 19.05, and has been replaced by --chdir. I didn't see a change log about
this, did I miss it? Going through the man pages it seems it hasn't existe
rs on behalf of Christopher Benjamin Coffey"
wrote:
Hmm it seems that a job submit plugin fix will not be possible due to the
attribute being removed from the api
Am I missing something here?
—
Christopher Coffey
High-Performance Computing
Northe
Ya, I saw that it was almost removed before 19.05. I didn't know about the NEWS
file! Yep its right there, mea culpa; I'll check that in the future!
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 8/15/19, 11:08 AM, "slurm-users on beha
Hi Marcus,
What is the reason to add "--mem-per-cpu" when the job already has exclusive
access to the node? Your job has access to all of the memory, and all of the
cores on the system already. Also note, for non-mpi code like single core job,
or shared memory threaded job, you want to ask for
20/19 4:58 PM, Christopher Benjamin Coffey wrote:
> Hi Marcus,
>
> What is the reason to add "--mem-per-cpu" when the job already has
exclusive access to the node?
The user (normally) does not set --exclusive directly. We have several
accounts, who
Hi,
Can someone help me understand what this error is?
select/cons_res: node cn95 memory is under-allocated (125000-135000) for
JobId=23544043
We get a lot of these from time to time and I don't understand what its about?
Looking at the code it doesn't make sense for this to be happening on ru
Hi Paul,
I submitted the poll - thanks! For bug #7609, while I'd be happier with a built
in slurm solution, you may find that our jobscript archiver implementation
would work nicely for you. It is very high-performing and has no effect on the
scheduler, or db performance.
The solution is a mu
Hi Tina,
I think you could just have a qos called "override" that has no limits, or
maybe just high limits. Then, just modify the job's qos to be "override" with
scontrol. Based on your setup, you may also have to update the jobs account to
an "override" type account with no limits.
We do this
Hi,
We've been using jobacct_gather/cgroup for quite some time and haven't had any
issues (I think). We do see some lengthy job cleanup times when there are lots
of small jobs completing at once, maybe that is due to the cgroup plugin. At
SLUG19 a slurm dev presented information that the jobacc
alk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471
* Christopher Benjamin Coffey [191022 16:26]:
> Hi,
>
>
n 10/25/19 1:48 AM, Brian Andrus wrote:
> IIRC, the big difference is if you want to use cgroups on the nodes.
> You must use the cgroup plugin.
>
> Brian Andrus
>
> On 10/24/2019 3:54 PM, Christopher Benjamin Coffey wrote:
>> Hi Juergen,
>>
Brian, I've actually just started attempting to build slurm 19 on centos 8
yesterday. As you say, there are packages missing now from repos like:
rpmbuild -ta slurm-19.05.3-2.tar.bz2 --define '%_with_lua 1' --define
'%_with_x11 1'
warning: Macro expanded in comment on line 22: %_prefix path
Yes, I'd be interested too.
Best,
Chris
--
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 10/30/19, 3:54 AM, "slurm-users on behalf of Andy Georges"
wrote:
Hi Brian,
On Mon, Oct 28, 2019 at 10:42:59AM -0700, Brian Andrus wrote:
Hi,
I believe I heard recently that you could limit the number of users jobs that
accrue age priority points. Yet, I cannot find this option in the man pages.
Anyone have an idea? Thank you!
Best,
Chris
--
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-116
Ahh hah! Thanks Killian!
Best,
Chris
--
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 12/12/19, 3:03 PM, "slurm-users on behalf of Kilian Cavalotti"
wrote:
Hi Chris,
On Thu, Dec 12, 2019 at 10:47 AM Christopher Benja
igh-Performance Computing
Northern Arizona University
928-523-1167
On 12/12/19, 3:23 PM, "slurm-users on behalf of Christopher Benjamin Coffey"
wrote:
Ahh hah! Thanks Killian!
Best,
Chris
--
Christopher Coffey
High-Performance Computing
No
High-Performance Computing
Northern Arizona University
928-523-1167
On 12/12/19, 10:46 PM, "slurm-users on behalf of Chris Samuel"
wrote:
Hi Chris,
On 12/12/19 3:16 pm, Christopher Benjamin Coffey wrote:
> What am I missing?
It's just a sett
Hi All,
I wonder if any of you have seen these errors in slurmdbd.log
error: persistent connection experienced an error
When we see these errors, we are seeing job errors with some kind of accounting
in slurm like:
slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should
nev
Hi All,
Thought I'd try this one more time. Anyone have "assoc_limit_stop" option in
use? Care to try explaining what it does exactly? This doesn't really make a
ton of since as it is said in the man page:
assoc_limit_stop
If set and a job cannot start due to association li
AM, "slurm-users on behalf of Chris Samuel"
wrote:
On Tuesday, 6 November 2018 5:30:31 AM AEDT Christopher Benjamin Coffey
wrote:
> Can anyone else confirm that it is not possible to reserve a GPU? Seems a
> bit strange.
This looks like the bug that was referred
www.bgsu.edu
Message: 1
Date: Tue, 19 May 2020 18:19:26 +
From: Christopher Benjamin Coffey
To: Slurm User Community List
Subject: Re: [slurm-users] Reserving a GPU
Message-ID: <387dee1d-f060-47c3-afb9-0309684c2...@nau.edu>
Con
Hi All,
Anyone know if its possible yet to reserve a gpu? Maybe in 20.02? Thanks!
Best,
Chris
--
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 5/19/20, 3:04 PM, "slurm-users on behalf of Christopher Benjamin Coffey"
wrote:
Hi Niels,
Have you found a solution? I just noticed this recently as well. We've
traditionally told our users to use --gres:gpu:tesla:# for requesting gpus.
Then, our job submit plugin would detect the gres ask, specifically gpu, and
set a a qos, and partition accordingly. Unforutnately I start
Hi,
It doesn't appear to be possible to hide a partition from all normal users, but
allow for the slurm admins and condo users to still see. While a partition is
hidden, it is required to use "sudo" to see the partition even from a slurm
admin. This behavior is seen while adding the following t
t it suffice to use the "-a" option, e.g. "sinfo -s -a" or "squeue
-a"?
The admins coudl create an alias for that.
Best
Marcus
Am 21.01.2021 um 19:15 schrieb Christopher Benjamin Coffey:
> Hi,
>
> It doesn't appear to be pos
Howdy,
With the release of 21.08 series of slurm, we now have the ability to archive
batch scripts within slurm. Yeah, thanks! This is very cool and handy, yet
before this feature was added to slurm, we developed another option that may be
of interest to you. In my opinion, it’s a better one as
Hello!
I'm trying to test an upgrade of our production slurm db on a test cluster.
Specifically I'm trying to verify a update from 20.11.7 to 21.08.4. I have a
dump of the production db, and imported as normal. Then firing up slurmdbd to
perform the conversion. I've verified everything I can th
opher Benjamin Coffey"
wrote:
Hello!
I'm trying to test an upgrade of our production slurm db on a test cluster.
Specifically I'm trying to verify a update from 20.11.7 to 21.08.4. I have a
dump of the production db, and imported as normal. Then firing up slurmdbd to
perform
Hello!
The job_submit plugin doesn't appear to have a way to detect whether a user
requested "--exclusive". Can someone confirm this? Going through the code:
src/plugins/job_submit/lua/job_submit_lua.c I don't see anything related.
Potentially "shared" could be possible in some way. But trials
* or NO_VAL to accept the system default.
* SHARED_FORCE to eliminate user control.
*/
If there’s a case where using “.shared” isn’t working please let us know.
-Greg
From: slurm-users on behalf of
Christopher Benjamin Coffe
* SHARED_FORCE to eliminate user
control. */
If there’s a case where using “.shared” isn’t working please let us
know.
-Greg
From: slurm-users on behalf of
Christopher Benjamin Coffey
Date: Saturday, 19 February 2022 at 3:17 am
To:
Hi Miguel,
This is intriguing as I didn't know about this possibility, in dealing with
fairshare, and limited priority minutes qos at the same time. How can you
verify how many minutes have been used of this qos that has been setup with
grptresmins ? Is that possible? Thanks.
Best,
Chris
--
#x27; format=login,used’
If you are willing to accept some rounding errors!
With slight variations, and some oddities, this can also be used to limit
GPU utilisation, as is in our case as you can deduce from the previous command.
Best,
Miguel Afonso Oliveira
On
Hello,
We have been trying to upgrade slurm on our cluster from 16.05.6 to 17.11.3.
I'm thinking this should be doable? Past upgrades have been a breeze, and I
believe during the last one, the db upgrade took like 25 minutes. Well now, the
db upgrade process is taking far too long. We previousl
g.
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 2/21/18, 5:08 PM, "slurm-users on behalf of Kurt H Maier"
wrote:
On Wed, Feb 21, 2018 at 11:56:38PM +0000, Christopher Benjamin Coffey wrote:
> Hello,
>
&
Loris,
It’s simple, tell folks only to use -n for mpi jobs, and -c otherwise
(default).
It’s a big deal if folks use -n when it’s not an mpi program. This is because
the non mpi program is launched n times (instead of once with internal threads)
and will stomp over logs and output files (unco
Y chance if future upgrades
> will cause the same problems or if this will become better?
>
> Regards,
> Malte
>
>
>
>
>
>
> Am 22.02.2018 um 01:30 schrieb Christopher Benjamin Coffey:
>> This is great t
we tell everyone to use srun to launch every type of task.
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 2/22/18, 8:25 AM, "slurm-users on behalf of Loris Bennett"
wrote:
Hi, Other Chris,
Chris
l"
wrote:
On Friday, 23 February 2018 7:57:54 AM AEDT Christopher Benjamin Coffey
wrote:
> Yes, maybe that’s true about what you say when not using srun. I'm not
sure,
> as we tell everyone to use srun to launch every type of task.
I've not done that out
Good thought Chris. Yet in our case our system does not have the
spectre/meltdown kernel fix.
Just to update everyone, we performed the upgrade successfully after we purged
more data jobs/steps first. We did the following to ensure the purge happened
right away per Hendryk's recommendation:
Ar
We tell our users to do this:
squeue -h -t R -O gres | grep gpu|wc -l
The command above will report the number of GPUs in use. If the number is 16,
then all of the GPUs are currently being used. If nothing is displayed, then
all of the GPUs are available.
In our case we have 16 GPU's. Probably
Hi, we have an issue currently where we have a bunch (56K) of runaway jobs, but
we cannot clear them:
sacctmgr show runaway|wc -l
sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable
58588
Has anyone run
ter storage for the db but does it seem reasonable
for slurm to crash under the circumstances that I mentioned?
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 4/24/18, 10:20 AM, "slurm-users on behalf of Christopher Benjamin Coffey
Hi,
I have a user trying to use %t to split the mpi rank outputs into different
files and it's not working. I verified this too. Any idea why this might be?
This is the first that I've heard of a user trying to do this. Here is an
example job script file:
-
#!/bin/bash
#SBATCH --job-name=m
Hi, we have an issue currently where we have a bunch of runaway jobs, but we
cannot clear them:
sacctmgr show runaway|wc -l
sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable
58588
Has anyone run into t
Hi,
We noticed that recently --uid, and --gid functionality changed where
previously a user in the slurm administrators group could launch jobs
successfully with --uid, and --gid , allowing for them to submit jobs as
another user. Now, in order to use --uid, --gid, you have to be the root user
Thanks Chris! :)
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 5/10/18, 12:42 AM, "slurm-users on behalf of Chris Samuel"
wrote:
On Thursday, 10 May 2018 2:25:49 AM AEST Christopher Benjamin Coffey wrote:
> I have a u
Hi,
I've compiled slurm 17.11.7 with x11 support. We can ssh to a node from the
login node and get xeyes to work, etc. However, srun --x11 xeyes results in:
[cbc@wind ~ ]$ srun --x11 --reservation=root_58 xeyes
X11 connection rejected because of wrong authentication.
Error: Can't open display: l
he x11 connects just fine.
Hadrian
On Thu, Jun 7, 2018 at 6:26 PM, Christopher Benjamin Coffey
wrote:
Hi,
I've compiled slurm 17.11.7 with x11 support. We can ssh to a node from the
login node and get xeyes to work, etc. However, srun --x11 xeyes
97 matches
Mail list logo