[slurm-users] Jobs blocking scheduling progress

2018-07-03 Thread Christopher Benjamin Coffey
Hello! We are having an issue with high priority gpu jobs blocking low priority cpu only jobs. Our cluster is setup with one partition, "all". All nodes reside in this cluster. In this all partition we have four generations of compute nodes, including gpu nodes. We do this to make use of those

[slurm-users] Slurmstepd sleep processes

2018-08-03 Thread Christopher Benjamin Coffey
Hello, Has anyone observed "sleep 1" processes on their compute nodes? They seem to be tied to the slurmstepd extern process in slurm: 4 S root 136777 1 0 80 0 - 73218 do_wai 05:48 ?00:00:01 slurmstepd: [13220317.extern] 0 S root 136782 136777 0 80 0 - 25229

[slurm-users] Slurm 17.11.9, sshare undefined symbol

2018-08-20 Thread Christopher Benjamin Coffey
Hi, We've just recently installed slurm 17.11.9 and noticed an issue with sshare: sshare: error: plugin_load_from_file: dlopen(/usr/lib64/slurm/priority_multifactor.so): /usr/lib64/slurm/priority_multifactor.so: undefined symbol: sort_part_tier sshare: error: Couldn't load specified plugin name

Re: [slurm-users] Slurm 17.11.9, sshare undefined symbol

2018-08-24 Thread Christopher Benjamin Coffey
-1167 On 8/20/18, 1:21 PM, "slurm-users on behalf of Christopher Benjamin Coffey" wrote: Hi, We've just recently installed slurm 17.11.9 and noticed an issue with sshare: sshare: error: plugin_load_from_file: dlopen(/usr/lib64/slurm/priority_multifactor.so

Re: [slurm-users] Slurm 17.11.9, sshare undefined symbol

2018-08-24 Thread Christopher Benjamin Coffey
.11.10. -Paul Edmon- On 08/24/2018 02:55 PM, Christopher Benjamin Coffey wrote: > Odd that no one has this issue. Must be a site issue then? If so, can't think of what that would be. I suppose we may wait for .10 to be released where it looks like this may

Re: [slurm-users] Slurm strigger configuration

2018-09-19 Thread Christopher Benjamin Coffey
Hi Jodie, The only thing that I've gotten working so far is this: sudo -u slurm bash -c "strigger --set -D -n cn15 -p /common/adm/slurm/triggers/nodestatus" So, that will run the nodestatus script which emails when the node cn15 gets set into drain state. What I'd like to do, which I haven't p

Re: [slurm-users] Slurm strigger configuration

2018-09-19 Thread Christopher Benjamin Coffey
Killian, thank you very much! Never noticed the perm flag! Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 9/19/18, 10:01 AM, "slurm-users on behalf of Kilian Cavalotti" wrote: On Wed, Sep 19, 2018 at 9:21 AM Christophe

Re: [slurm-users] Upgrading a slurm on a cluster, 17.02 --> 18.08

2018-09-27 Thread Christopher Benjamin Coffey
Hi David, I'd recommend the following that I've learned from bad experiences upgrading between the last major version. 1. Consider upgrading to mysql-server 5.5 or greater 2. Purge/archive unneeded jobs/steps before the upgrade, to make the upgrade as quick as possible: slurmdbd.conf:

[slurm-users] Heterogeneous job one MPI_COMM_WORLD

2018-10-09 Thread Christopher Benjamin Coffey
Hi, I have a user trying to setup a heterogeneous job with one MPI_COMM_WORLD with the following: == #!/bin/bash #SBATCH --job-name=hetero #SBATCH --output=/scratch/cbc/hetero.txt #SBATCH --time=2:00 #SBATCH --workdir=/scratch/cbc #SBATCH --cpu

Re: [slurm-users] Heterogeneous job one MPI_COMM_WORLD

2018-10-10 Thread Christopher Benjamin Coffey
er if it is ill advised to enable it!? Suppose I could try it. Thanks Chris! Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 10/10/18, 12:11 AM, "slurm-users on behalf of Chris Samuel" wrote: On 10/10/18 05:07, Christophe

Re: [slurm-users] Documentation for creating a login node for a SLURM cluster

2018-10-12 Thread Christopher Benjamin Coffey
In addition, fwiw, this login node will have a second network connection of course for campus with firewall setup to only allow ssh (and other essential) from campus. Also you may consider having some script developed to prevent folks from abusing the login node instead of using slurm for their

[slurm-users] Reserving a GPU

2018-10-22 Thread Christopher Benjamin Coffey
Hi, I can't figure out how one would create a reservation to reserve a gres unit, such as a gpu. The man page doesn't really say that gres is supported for a reservation, but it does say tres is supported. Yet, I can't seem to figure out how one could specify a gpu with tres. I've tried: scon

[slurm-users] Meaning of assoc_limit_stop

2018-10-22 Thread Christopher Benjamin Coffey
Hi, My question is in regard to the scheduling parameter: assoc_limit_stop "If set and a job cannot start due to association limits, then do not attempt to initiate any lower priority jobs in that partition. Setting this can decrease system throughput and utilization, but avoid potentially st

Re: [slurm-users] Reserving a GPU

2018-10-22 Thread Christopher Benjamin Coffey
you cannot. On Mon, Oct 22, 2018, 11:51 Christopher Benjamin Coffey wrote: Hi, I can't figure out how one would create a reservation to reserve a gres unit, such as a gpu. The man page doesn't really say that gres is supported for a reservation, but

Re: [slurm-users] Reserving a GPU

2018-11-05 Thread Christopher Benjamin Coffey
Can anyone else confirm that it is not possible to reserve a GPU? Seems a bit strange. Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 10/22/18, 10:01 AM, "slurm-users on behalf of Christopher Benjamin Coffey" wrote:

[slurm-users] Slurm Job Efficiency Tools

2018-11-19 Thread Christopher Benjamin Coffey
Hi, I gave a presentation at SC in the slurm booth on some slurm job efficiency tools, and web app that we developed. I figured that maybe others in this group could be interested too. If you'd like to see the short presentation, and the tools, and links to them, please see this presentation:

Re: [slurm-users] About x11 support

2018-11-20 Thread Christopher Benjamin Coffey
Hi Chris, Are you using the built in slurm x11 support? Or that spank plugin? We haven't been able to get the right combo of things in place to get the built in x11 to work. Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 11/15/18, 5

[slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-11-29 Thread Christopher Benjamin Coffey
Hi, We've been noticing an issue with nodes from time to time that become "wedged", or unusable. This is a state where ps, and w hang. We've been looking into this for a while when we get time and finally put some more effort into it yesterday. We came across this blog which describes almost th

Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-12-04 Thread Christopher Benjamin Coffey
group, to see if it had been touched by this process. On Fri, 30 Nov 2018 at 09:31, Ole Holm Nielsen wrote: On 29-11-2018 19:27, Christopher Benjamin Coffey wrote: > We've been noticing an

Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-12-07 Thread Christopher Benjamin Coffey
Is this parameter applied to each cgroup? Or just the system itself? Seems like just the system itself. — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 12/4/18, 10:13 AM, "slurm-users on behalf of Christopher Benjamin Coffey&quo

[slurm-users] Slurm mysql 8.0

2018-12-14 Thread Christopher Benjamin Coffey
Hi Guys, It appears that slurm currently doesn't support mysql 8.0. After upgrading from 5.7 to 8.0 slurm commands that hit the db result in: sacct: error: slurmdbd: "Unknown error 1064" This is at least true for version 17.11.12. I wonder what the plan is for slurm to support mariadb, and my

Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job Arrays

2018-12-21 Thread Christopher Benjamin Coffey
So this issue is occurring only with job arrays. — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 12/21/18, 12:15 PM, "slurm-users on behalf of Chance Bryce Carl Nelson" wrote: Hi folks, calling sacct with the usercpu flag enable

Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job Arrays

2019-01-04 Thread Christopher Benjamin Coffey
magnitude. Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 12/21/18, 2:41 PM, "Christopher Benjamin Coffey" wrote: So this issue is occurring only with job arrays. — Christopher Coffey High-Performance Co

Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job Arrays

2019-01-04 Thread Christopher Benjamin Coffey
hoping that this note in the 18.08.4 NEWS might have been related: -- Fix jobacct_gather/cgroup to work correctly when more than one task is started on a node. Thanks, Paddy On Fri, Jan 04, 2019 at 03:19:18PM +, Christopher Benjamin Coffey wrote: &

Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job Arrays

2019-01-08 Thread Christopher Benjamin Coffey
ue? Just to note: there's a big warning in the man page not to adjust the value of JobAcctGatherType while there are any running job steps. I'm not sure if that means just on that node, or any jobs. Probably safest to schedule a downtime to change it. Paddy

Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu values

2019-01-09 Thread Christopher Benjamin Coffey
Thanks... looks like the bug should get some attention now that a paying site is complaining: https://bugs.schedmd.com/show_bug.cgi?id=6332 Thanks Jurij! Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 1/9/19, 7:24 AM, "slurm-users on

Re: [slurm-users] Larger jobs tend to get starved out on our cluster

2019-01-10 Thread Christopher Benjamin Coffey
Hi D.J., I noticed you have: PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME,FAIR_TREE I'm pretty sure it does not makes sense to have depth oblivious, and fair tree set at the same time. You'll want to choose one of them. That’s not going to be reason for the issue however, but you are l

Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job Arrays

2019-01-10 Thread Christopher Benjamin Coffey
We've attempted setting JobAcctGatherFrequency=task=0 and there is no change. We have settings: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity JobAcctGatherType=jobacct_gather/cgroup Odd ... wonder why we don't see it help. Here is how we verify: === #!/bin/bash #SBATCH --

Re: [slurm-users] Larger jobs tend to get starved out on our cluster

2019-01-14 Thread Christopher Benjamin Coffey
Hi David, You are welcome. I'm surprised that srun does not work for you. We advise our users to use srun on every type of job, not just MPI. This in our opinion keeps it simple, and it just works. What is your MpiDefault set to in slurm.conf? Is your openmpi built with slurm support? I believe

[slurm-users] Slurm 18.08.5 slurmctl error messages

2019-01-31 Thread Christopher Benjamin Coffey
Hi, we upgraded to 18.08.5 this morning and are seeing odd errors in the slurmctld logs: [2019-01-31T08:24:13.684] error: select_nodes: calling _get_req_features() for JobId=16599048 with not NULL job resources [2019-01-31T08:24:13.685] error: select_nodes: calling _get_req_features() for JobId

Re: [slurm-users] Slurm 18.08.5 slurmctl error messages

2019-01-31 Thread Christopher Benjamin Coffey
On 1/31/19, 8:30 AM, "slurm-users on behalf of Christopher Benjamin Coffey" wrote: Hi, we upgraded to 18.08.5 this morning and are seeing odd errors in the slurmctld logs: [2019-01-31T08:24:13.684] error: select_nodes: calling _get_req_features() for JobId=16599048 wi

Re: [slurm-users] Slurm 18.08.5 slurmctl error messages

2019-01-31 Thread Christopher Benjamin Coffey
hings appear to work as normal. Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 1/31/19, 9:23 AM, "slurm-users on behalf of Christopher Samuel" wrote: On 1/31/19 8:12 AM, Christopher Benjamin Coffey wrote: >

Re: [slurm-users] TotalCPU: sacct reporting inexplicable high values

2019-02-01 Thread Christopher Benjamin Coffey
Nico, yep that’s a very annoying bug as we do the same here with job efficiency. It was patched in 18.08.05. However the db still needs to be cleaned up. We are working on a script to fix this. When we are done, we'll offer it up to the list. Best, Chris — Christopher Coffey High-Performance C

Re: [slurm-users] seff: incorrect memory usage (18.08.5-2)

2019-02-26 Thread Christopher Benjamin Coffey
Hi Loris, Odd, we never saw that issue with memory efficiency being out of whack, just the cpu efficiency. We are running 18.08.5-2 and here is a 512 core job run last night: Job ID: 18096693 Array Job ID: 18096693_5 Cluster: monsoon User/Group: abc123/cluster State: COMPLETED (exit code 0) Nod

Re: [slurm-users] seff: incorrect memory usage (18.08.5-2)

2019-03-04 Thread Christopher Benjamin Coffey
00 MB/core) which looks good. I'll see how it goes with longer running job. Thanks for the input, Loris Christopher Benjamin Coffey writes: > Hi Loris, > > Odd, we never saw that issue with memory efficiency being out of whack, jus

Re: [slurm-users] Slurm --mail-to setup help

2019-03-11 Thread Christopher Benjamin Coffey
Hi Chad, My memory is a little hazy on how this was setup but ... man slurm.conf MailProg Fully qualified pathname to the program used to send email per user request. The default value is "/bin/mail" (or "/usr/bin/mail" if "/bin/mail" does not exist but "/usr/bin/mail" does exist). Slurm is ca

Re: [slurm-users] Slurm --mail-to setup help

2019-03-11 Thread Christopher Benjamin Coffey
Chad, Hah! Just reread the man page. If you use this: MailDomain Domain name to qualify usernames if email address is not explicitly given with the "--mail-user" option. If unset, the local MTA will need to qualify local address itself. Shouldn't need to worry about the .forward stuff if you

Re: [slurm-users] X11 forwarding and VNC?

2019-03-22 Thread Christopher Benjamin Coffey
Loris, Glad you've made some progress. We finally got it working as well, and have two findings: 1. the login node fqdn must be the same as the compute nodes 2. --x11 is not required to be added to srun and actually causes it to fail for some reason for us. Very odd, anyone have thoughts? - No

[slurm-users] Slurm Jobscript Archiver

2019-05-09 Thread Christopher Benjamin Coffey
Hi All, We created a slurm job script archiver which you may find handy. We initially attempted to do this through slurm with a slurmctld prolog but it really bogged the scheduler down. This new solution is a custom c++ program that uses inotify to watch for job scripts and environment files to

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Christopher Benjamin Coffey
Hi, you may want to look into increasing the sssd cache length on the nodes, and improving the network connectivity to your ldap directory. I recall when playing with sssd in the past that it wasn't actually caching. Verify with tcpdump, and "ls -l" through a directory. Once the uid/gid is resol

Re: [slurm-users] Slurm Jobscript Archiver

2019-06-14 Thread Christopher Benjamin Coffey
fied version, let me know. Kind regards, Lech > Am 09.05.2019 um 17:37 schrieb Christopher Benjamin Coffey : > > Hi All, > > We created a slurm job script archiver which you may find handy. We initially attempted to do this through

Re: [slurm-users] Slurm Jobscript Archiver

2019-06-17 Thread Christopher Benjamin Coffey
ted in our modified version, let me know. Kind regards, Lech > Am 09.05.2019 um 17:37 schrieb Christopher Benjamin Coffey : > > Hi All, > > We created a slurm job script archiver which you may find handy. We initially attempted to do this through

Re: [slurm-users] Slurm Jobscript Archiver

2019-06-17 Thread Christopher Benjamin Coffey
Thanks Kevin, we'll put a fix in for that. Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 6/17/19, 12:04 AM, "Kevin Buckley" wrote: On 2019/05/09 23:37, Christopher Benjamin Coffey wrote: > Feel free

Re: [slurm-users] Slurm Jobscript Archiver

2019-06-20 Thread Christopher Benjamin Coffey
Hi Kevin, We fixed the issue on github. Thanks! Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 6/17/19, 8:56 AM, "slurm-users on behalf of Christopher Benjamin Coffey" wrote: Thanks Kevin, we'll put a

[slurm-users] Sshare -l segfaults

2019-07-12 Thread Christopher Benjamin Coffey
Hi All, Has anyone had issues with sshare segfaulting? Specifically with "sshare -l"? Any suggestions on how to figure this one out? Maybe there is something obvious I'm not seeing. This has been happening for many slurm versions, I can't recall when it started. For the last couple versions I'v

[slurm-users] 19.05 and GPUs vs GRES

2019-08-12 Thread Christopher Benjamin Coffey
Hi, Excuse me if this has been explained somewhere, I did some searching. With 19.05, is there any reason to have gres.conf on the GPU nodes? Is slurm smart enough to enumerate the /dev/nvidia* devices? We are moving to 19.05 shortly, any gotchas with GRES and GPUs? Also, I'm guessing now, the

Re: [slurm-users] 19.05 and GPUs vs GRES

2019-08-13 Thread Christopher Benjamin Coffey
University 928-523-1167 On 8/12/19, 10:28 PM, "slurm-users on behalf of Chris Samuel" wrote: On Monday, 12 August 2019 11:42:48 AM PDT Christopher Benjamin Coffey wrote: > Excuse me if this has been explained somewhere, I did some searching. With > 19.05, is there an

[slurm-users] Slurm 19.05 --workdir non existent?

2019-08-14 Thread Christopher Benjamin Coffey
Hi, It seems that --workdir= is no longer a valid option in batch jobs and srun in 19.05, and has been replaced by --chdir. I didn't see a change log about this, did I miss it? Going through the man pages it seems it hasn't existed for some time now actually! Maybe not since before 17.11 series

Re: [slurm-users] Slurm 19.05 --workdir non existent?

2019-08-14 Thread Christopher Benjamin Coffey
lf of Christopher Benjamin Coffey" wrote: Hi, It seems that --workdir= is no longer a valid option in batch jobs and srun in 19.05, and has been replaced by --chdir. I didn't see a change log about this, did I miss it? Going through the man pages it seems it hasn't existe

Re: [slurm-users] Slurm 19.05 --workdir non existent?

2019-08-15 Thread Christopher Benjamin Coffey
rs on behalf of Christopher Benjamin Coffey" wrote: Hmm it seems that a job submit plugin fix will not be possible due to the attribute being removed from the api Am I missing something here? — Christopher Coffey High-Performance Computing Northe

Re: [slurm-users] Slurm 19.05 --workdir non existent?

2019-08-15 Thread Christopher Benjamin Coffey
Ya, I saw that it was almost removed before 19.05. I didn't know about the NEWS file! Yep its right there, mea culpa; I'll check that in the future! Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 8/15/19, 11:08 AM, "slurm-users on beha

Re: [slurm-users] exclusive or not exclusive, that is the question

2019-08-20 Thread Christopher Benjamin Coffey
Hi Marcus, What is the reason to add "--mem-per-cpu" when the job already has exclusive access to the node? Your job has access to all of the memory, and all of the cores on the system already. Also note, for non-mpi code like single core job, or shared memory threaded job, you want to ask for

Re: [slurm-users] exclusive or not exclusive, that is the question

2019-08-21 Thread Christopher Benjamin Coffey
20/19 4:58 PM, Christopher Benjamin Coffey wrote: > Hi Marcus, > > What is the reason to add "--mem-per-cpu" when the job already has exclusive access to the node? The user (normally) does not set --exclusive directly. We have several accounts, who

[slurm-users] Node resource is under-allocated

2019-08-27 Thread Christopher Benjamin Coffey
Hi, Can someone help me understand what this error is? select/cons_res: node cn95 memory is under-allocated (125000-135000) for JobId=23544043 We get a lot of these from time to time and I don't understand what its about? Looking at the code it doesn't make sense for this to be happening on ru

Re: [slurm-users] Slurm Feature Poll

2019-08-28 Thread Christopher Benjamin Coffey
Hi Paul, I submitted the poll - thanks! For bug #7609, while I'd be happier with a built in slurm solution, you may find that our jobscript archiver implementation would work nicely for you. It is very high-performing and has no effect on the scheduler, or db performance. The solution is a mu

Re: [slurm-users] One time override to force run job

2019-09-04 Thread Christopher Benjamin Coffey
Hi Tina, I think you could just have a qos called "override" that has no limits, or maybe just high limits. Then, just modify the job's qos to be "override" with scontrol. Based on your setup, you may also have to update the jobs account to an "override" type account with no limits. We do this

[slurm-users] jobacct_gather/linux vs jobacct_gather/cgroup

2019-10-22 Thread Christopher Benjamin Coffey
Hi, We've been using jobacct_gather/cgroup for quite some time and haven't had any issues (I think). We do see some lengthy job cleanup times when there are lots of small jobs completing at once, maybe that is due to the cgroup plugin. At SLUG19 a slurm dev presented information that the jobacc

Re: [slurm-users] jobacct_gather/linux vs jobacct_gather/cgroup

2019-10-24 Thread Christopher Benjamin Coffey
alk Scientific Software & Compute Services (SSCS) Kommunikations- und Informationszentrum (kiz) Universität Ulm Telefon: +49 (0)731 50-22478 Telefax: +49 (0)731 50-22471 * Christopher Benjamin Coffey [191022 16:26]: > Hi, > >

Re: [slurm-users] jobacct_gather/linux vs jobacct_gather/cgroup

2019-10-29 Thread Christopher Benjamin Coffey
n 10/25/19 1:48 AM, Brian Andrus wrote: > IIRC, the big difference is if you want to use cgroups on the nodes. > You must use the cgroup plugin. > > Brian Andrus > > On 10/24/2019 3:54 PM, Christopher Benjamin Coffey wrote: >> Hi Juergen, >>

Re: [slurm-users] RHEL8 support - Missing Symbols in SelectType libraries

2019-10-29 Thread Christopher Benjamin Coffey
Brian, I've actually just started attempting to build slurm 19 on centos 8 yesterday. As you say, there are packages missing now from repos like: rpmbuild -ta slurm-19.05.3-2.tar.bz2 --define '%_with_lua 1' --define '%_with_x11 1' warning: Macro expanded in comment on line 22: %_prefix path

Re: [slurm-users] RHEL8 support

2019-10-30 Thread Christopher Benjamin Coffey
Yes, I'd be interested too. Best, Chris -- Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 10/30/19, 3:54 AM, "slurm-users on behalf of Andy Georges" wrote: Hi Brian, On Mon, Oct 28, 2019 at 10:42:59AM -0700, Brian Andrus wrote:

[slurm-users] Maxjobs to accrue age priority points

2019-12-12 Thread Christopher Benjamin Coffey
Hi, I believe I heard recently that you could limit the number of users jobs that accrue age priority points. Yet, I cannot find this option in the man pages. Anyone have an idea? Thank you! Best, Chris -- Christopher Coffey High-Performance Computing Northern Arizona University 928-523-116

Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-12 Thread Christopher Benjamin Coffey
Ahh hah! Thanks Killian! Best, Chris -- Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 12/12/19, 3:03 PM, "slurm-users on behalf of Kilian Cavalotti" wrote: Hi Chris, On Thu, Dec 12, 2019 at 10:47 AM Christopher Benja

Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-12 Thread Christopher Benjamin Coffey
igh-Performance Computing Northern Arizona University 928-523-1167 On 12/12/19, 3:23 PM, "slurm-users on behalf of Christopher Benjamin Coffey" wrote: Ahh hah! Thanks Killian! Best, Chris -- Christopher Coffey High-Performance Computing No

Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-13 Thread Christopher Benjamin Coffey
High-Performance Computing Northern Arizona University 928-523-1167 On 12/12/19, 10:46 PM, "slurm-users on behalf of Chris Samuel" wrote: Hi Chris, On 12/12/19 3:16 pm, Christopher Benjamin Coffey wrote: > What am I missing? It's just a sett

[slurm-users] error: persistent connection experienced an error

2019-12-13 Thread Christopher Benjamin Coffey
Hi All, I wonder if any of you have seen these errors in slurmdbd.log error: persistent connection experienced an error When we see these errors, we are seeing job errors with some kind of accounting in slurm like: slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should nev

Re: [slurm-users] Meaning of assoc_limit_stop

2020-01-09 Thread Christopher Benjamin Coffey
Hi All, Thought I'd try this one more time. Anyone have "assoc_limit_stop" option in use? Care to try explaining what it does exactly? This doesn't really make a ton of since as it is said in the man page: assoc_limit_stop If set and a job cannot start due to association li

Re: [slurm-users] Reserving a GPU

2020-05-19 Thread Christopher Benjamin Coffey
AM, "slurm-users on behalf of Chris Samuel" wrote: On Tuesday, 6 November 2018 5:30:31 AM AEDT Christopher Benjamin Coffey wrote: > Can anyone else confirm that it is not possible to reserve a GPU? Seems a > bit strange. This looks like the bug that was referred

Re: [slurm-users] Reserving a GPU (Christopher Benjamin Coffey)

2020-05-19 Thread Christopher Benjamin Coffey
www.bgsu.edu​ Message: 1 Date: Tue, 19 May 2020 18:19:26 + From: Christopher Benjamin Coffey To: Slurm User Community List Subject: Re: [slurm-users] Reserving a GPU Message-ID: <387dee1d-f060-47c3-afb9-0309684c2...@nau.edu> Con

Re: [slurm-users] Reserving a GPU (Christopher Benjamin Coffey)

2020-11-02 Thread Christopher Benjamin Coffey
Hi All, Anyone know if its possible yet to reserve a gpu? Maybe in 20.02? Thanks! Best, Chris -- Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 5/19/20, 3:04 PM, "slurm-users on behalf of Christopher Benjamin Coffey" wrote:

Re: [slurm-users] Getting --gpus -request in job_submit.lua

2020-11-25 Thread Christopher Benjamin Coffey
Hi Niels, Have you found a solution? I just noticed this recently as well. We've traditionally told our users to use --gres:gpu:tesla:# for requesting gpus. Then, our job submit plugin would detect the gres ask, specifically gpu, and set a a qos, and partition accordingly. Unforutnately I start

[slurm-users] Hidden partition visibility issue

2021-01-21 Thread Christopher Benjamin Coffey
Hi, It doesn't appear to be possible to hide a partition from all normal users, but allow for the slurm admins and condo users to still see. While a partition is hidden, it is required to use "sudo" to see the partition even from a slurm admin. This behavior is seen while adding the following t

Re: [slurm-users] Hidden partition visibility issue

2021-02-03 Thread Christopher Benjamin Coffey
t it suffice to use the "-a" option, e.g. "sinfo -s -a" or "squeue -a"? The admins coudl create an alias for that. Best Marcus Am 21.01.2021 um 19:15 schrieb Christopher Benjamin Coffey: > Hi, > > It doesn't appear to be pos

[slurm-users] Another batch script archiver solution

2021-10-05 Thread Christopher Benjamin Coffey
Howdy, With the release of 21.08 series of slurm, we now have the ability to archive batch scripts within slurm. Yeah, thanks! This is very cool and handy, yet before this feature was added to slurm, we developed another option that may be of interest to you. In my opinion, it’s a better one as

[slurm-users] Issues upgrading db from 20.11.7 -> 21.08.4

2022-02-04 Thread Christopher Benjamin Coffey
Hello! I'm trying to test an upgrade of our production slurm db on a test cluster. Specifically I'm trying to verify a update from 20.11.7 to 21.08.4. I have a dump of the production db, and imported as normal. Then firing up slurmdbd to perform the conversion. I've verified everything I can th

Re: [slurm-users] Issues upgrading db from 20.11.7 -> 21.08.4

2022-02-04 Thread Christopher Benjamin Coffey
opher Benjamin Coffey" wrote: Hello! I'm trying to test an upgrade of our production slurm db on a test cluster. Specifically I'm trying to verify a update from 20.11.7 to 21.08.4. I have a dump of the production db, and imported as normal. Then firing up slurmdbd to perform

[slurm-users] Can job submit plugin detect "--exclusive" ?

2022-02-18 Thread Christopher Benjamin Coffey
Hello! The job_submit plugin doesn't appear to have a way to detect whether a user requested "--exclusive". Can someone confirm this? Going through the code: src/plugins/job_submit/lua/job_submit_lua.c I don't see anything related. Potentially "shared" could be possible in some way. But trials

Re: [slurm-users] Can job submit plugin detect "--exclusive" ?

2022-02-22 Thread Christopher Benjamin Coffey
* or NO_VAL to accept the system default. * SHARED_FORCE to eliminate user control. */ If there’s a case where using “.shared” isn’t working please let us know. -Greg From: slurm-users on behalf of Christopher Benjamin Coffe

Re: [slurm-users] Can job submit plugin detect "--exclusive" ?

2022-02-22 Thread Christopher Benjamin Coffey
* SHARED_FORCE to eliminate user control. */ If there’s a case where using “.shared” isn’t working please let us know. -Greg From: slurm-users on behalf of Christopher Benjamin Coffey Date: Saturday, 19 February 2022 at 3:17 am To:

Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-23 Thread Christopher Benjamin Coffey
Hi Miguel, This is intriguing as I didn't know about this possibility, in dealing with fairshare, and limited priority minutes qos at the same time. How can you verify how many minutes have been used of this qos that has been setup with grptresmins ? Is that possible? Thanks. Best, Chris --

Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-23 Thread Christopher Benjamin Coffey
#x27; format=login,used’ If you are willing to accept some rounding errors! With slight variations, and some oddities, this can also be used to limit GPU utilisation, as is in our case as you can deduce from the previous command. Best, Miguel Afonso Oliveira On

[slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2018-02-21 Thread Christopher Benjamin Coffey
Hello, We have been trying to upgrade slurm on our cluster from 16.05.6 to 17.11.3. I'm thinking this should be doable? Past upgrades have been a breeze, and I believe during the last one, the db upgrade took like 25 minutes. Well now, the db upgrade process is taking far too long. We previousl

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2018-02-21 Thread Christopher Benjamin Coffey
g. Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 2/21/18, 5:08 PM, "slurm-users on behalf of Kurt H Maier" wrote: On Wed, Feb 21, 2018 at 11:56:38PM +0000, Christopher Benjamin Coffey wrote: > Hello, > &

Re: [slurm-users] ntasks and cpus-per-task

2018-02-22 Thread Christopher Benjamin Coffey
Loris, It’s simple, tell folks only to use -n for mpi jobs, and -c otherwise (default). It’s a big deal if folks use -n when it’s not an mpi program. This is because the non mpi program is launched n times (instead of once with internal threads) and will stomp over logs and output files (unco

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2018-02-22 Thread Christopher Benjamin Coffey
Y chance if future upgrades > will cause the same problems or if this will become better? > > Regards, > Malte > > > > > > > Am 22.02.2018 um 01:30 schrieb Christopher Benjamin Coffey: >> This is great t

Re: [slurm-users] ntasks and cpus-per-task

2018-02-22 Thread Christopher Benjamin Coffey
we tell everyone to use srun to launch every type of task. Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 2/22/18, 8:25 AM, "slurm-users on behalf of Loris Bennett" wrote: Hi, Other Chris, Chris

Re: [slurm-users] ntasks and cpus-per-task

2018-02-26 Thread Christopher Benjamin Coffey
l" wrote: On Friday, 23 February 2018 7:57:54 AM AEDT Christopher Benjamin Coffey wrote: > Yes, maybe that’s true about what you say when not using srun. I'm not sure, > as we tell everyone to use srun to launch every type of task. I've not done that out

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2018-02-26 Thread Christopher Benjamin Coffey
Good thought Chris. Yet in our case our system does not have the spectre/meltdown kernel fix. Just to update everyone, we performed the upgrade successfully after we purged more data jobs/steps first. We did the following to ensure the purge happened right away per Hendryk's recommendation: Ar

Re: [slurm-users] Available gpus ?

2018-03-16 Thread Christopher Benjamin Coffey
We tell our users to do this: squeue -h -t R -O gres | grep gpu|wc -l The command above will report the number of GPUs in use. If the number is 16, then all of the GPUs are currently being used. If nothing is displayed, then all of the GPUs are available. In our case we have 16 GPU's. Probably

[slurm-users] Runaway jobs issue: : Resource temporarily unavailable, slurm 17.11.3

2018-04-24 Thread Christopher Benjamin Coffey
Hi, we have an issue currently where we have a bunch (56K) of runaway jobs, but we cannot clear them: sacctmgr show runaway|wc -l sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable 58588 Has anyone run

Re: [slurm-users] Runaway jobs issue: : Resource temporarily unavailable, slurm 17.11.3

2018-04-24 Thread Christopher Benjamin Coffey
ter storage for the db but does it seem reasonable for slurm to crash under the circumstances that I mentioned? Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 4/24/18, 10:20 AM, "slurm-users on behalf of Christopher Benjamin Coffey

[slurm-users] Splitting mpi rank output

2018-05-09 Thread Christopher Benjamin Coffey
Hi, I have a user trying to use %t to split the mpi rank outputs into different files and it's not working. I verified this too. Any idea why this might be? This is the first that I've heard of a user trying to do this. Here is an example job script file: - #!/bin/bash #SBATCH --job-name=m

[slurm-users] Runaway jobs issue, slurm 17.11.3

2018-05-09 Thread Christopher Benjamin Coffey
Hi, we have an issue currently where we have a bunch of runaway jobs, but we cannot clear them: sacctmgr show runaway|wc -l sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable 58588 Has anyone run into t

[slurm-users] --uid , --gid option is root only now :'(

2018-05-10 Thread Christopher Benjamin Coffey
Hi, We noticed that recently --uid, and --gid functionality changed where previously a user in the slurm administrators group could launch jobs successfully with --uid, and --gid , allowing for them to submit jobs as another user. Now, in order to use --uid, --gid, you have to be the root user

Re: [slurm-users] Splitting mpi rank output

2018-05-14 Thread Christopher Benjamin Coffey
Thanks Chris! :) — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 5/10/18, 12:42 AM, "slurm-users on behalf of Chris Samuel" wrote: On Thursday, 10 May 2018 2:25:49 AM AEST Christopher Benjamin Coffey wrote: > I have a u

[slurm-users] srun --x11 connection rejected because of wrong authentication

2018-06-07 Thread Christopher Benjamin Coffey
Hi, I've compiled slurm 17.11.7 with x11 support. We can ssh to a node from the login node and get xeyes to work, etc. However, srun --x11 xeyes results in: [cbc@wind ~ ]$ srun --x11 --reservation=root_58 xeyes X11 connection rejected because of wrong authentication. Error: Can't open display: l

Re: [slurm-users] srun --x11 connection rejected because of wrong authentication

2018-06-11 Thread Christopher Benjamin Coffey
he x11 connects just fine. Hadrian On Thu, Jun 7, 2018 at 6:26 PM, Christopher Benjamin Coffey wrote: Hi, I've compiled slurm 17.11.7 with x11 support. We can ssh to a node from the login node and get xeyes to work, etc. However, srun --x11 xeyes