Re: [slurm-users] cpu limit issue

2018-07-10 Thread Renfro, Michael
Gaussian? Look for NProc=8 or similar lines (NPRocShared, could be other options, too) in their input files. There could also be some system-wide parallel settings for Gaussian, but that wouldn’t be the default. > On Jul 10, 2018, at 2:04 PM, Mahmood Naderan wrote: > > Hi, > I see that althoug

Re: [slurm-users] cpu limit issue

2018-07-11 Thread Renfro, Michael
Looking at your script, there’s a chance that by only specifying ntasks instead of ntasks-per-node or a similar parameter, you might have allocated 8 CPUs on one node, and the remaining 4 on another. Regardless, I’ve dug into my Gaussian documentation, and here’s my test case for you to see wha

Re: [slurm-users] siesta jobs with slurm, an issue

2018-07-22 Thread Renfro, Michael
You’re getting the same fundamental error in both the interactive and batch version, though. The ‘reinit: Reading from standard input’ line seemed off, since you were providing an argument for the input file. But all the references I find to running Siesta in their manual (section 3 and section

Re: [slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7

2018-08-30 Thread Renfro, Michael
Chris’ method will set CUDA_VISIBLE_DEVICES like you’re used to, and it will help keep you or your users from picking conflicting devices. My cgroup/GPU settings from slurm.conf: = [renfro@login ~]$ egrep -i '(cgroup|gpu)' /etc/slurm/slurm.conf | grep -v '^#' ProctrackType=proctrack/cgroup

[slurm-users] Recovering from network failures in Slurm (without killing or restarting active jobs)

2018-08-31 Thread Renfro, Michael
Hey, folks. I’ve got a Slurm 17.02 cluster (RPMs provided by Bright Computing, if it matters) with both gigabit Ethernet and Infiniband interfaces. Twice in the last year, I’ve had a failure inside the stacked Ethernet switches that’s caused Slurm to lose track of node and job state. Jobs kept r

Re: [slurm-users] Defining constraints for job dispatching

2018-09-01 Thread Renfro, Michael
Depending on the scale (what percent are Fluent users, how many nodes you have), you could use exclusive mode on either a per-partition or per-job basis. Here, my (currently few) Fluent users do all their GUI work off the cluster, and just submit batch jobs using the generated case and data file

Re: [slurm-users] How to set priorities of actual obs

2018-09-14 Thread Renfro, Michael
A 'nice -n 19' process will still consume 100% of the CPU if nothing else is going on. ‘top’ output from a dual-core system with 3 ‘dd’ processes -- 2 with default nice value of 0, and 1 with a nice value of 19: = PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND

Re: [slurm-users] Setting up a separate timeout for interactive jobs

2018-09-19 Thread Renfro, Michael
We have multiple partitions using the same nodes. The interactive partition is high priority and limited on time and resources. The batch partition is low priority and has looser time and resource restrictions. And we have a shell function that calls srun —partition=interactive —pty $SHELL to m

Re: [slurm-users] Setting up a separate timeout for interactive jobs

2018-09-19 Thread Renfro, Michael
@gmail.com>> wrote: Thanks for your response Mike. I have a follow-up question for this approach. How do you restrict someone to start an interactive session on the "batch" partition? On Wed, Sep 19, 2018 at 12:50 PM Renfro, Michael mailto:ren...@tntech.edu>> wrote: We ha

Re: [slurm-users] Defining constraints for job dispatching

2018-09-20 Thread Renfro, Michael
rtitions? Currently > they use > > srun -n 1 -c 6 --x11 -A monthly -p CAT --mem=32GB ./fluent.sh > > where fluent.sh is > > #!/bin/bash > unset SLURM_GTIDS > /state/partition1/ansys_inc/v140/fluent/bin/fluent > > > Regards, > Mahmood > > > >

Re: [slurm-users] swap size

2018-09-22 Thread Renfro, Michael
If your workflows are primarily CPU-bound rather than memory-bound, and since you’re the only user, you could ensure all your Slurm scripts ‘nice’ their Python commands, or use the -n flag for slurmd and the PropagatePrioProcess configuration parameter. Both of these are in the thread at https:

Re: [slurm-users] Job allocating more CPUs than requested

2018-09-22 Thread Renfro, Michael
Anecdotally, I’ve had a user cause load averages of 10x the node’s core count. The user caught it and cancelled the job before I noticed it myself. Where I’ve seen it happen live on less severe cases, I’ve never noticed anything other than the excessive load average. Viewed from ‘top’, the offen

Re: [slurm-users] maintenance partitions?

2018-10-05 Thread Renfro, Michael
A reservation overlapping with times you have the node in drain? Drain and reserve: # scontrol update nodename=node[037] state=drain reason=“testing" # scontrol create reservation users=renfro reservationname='drain_test' nodes=node[037] starttime=2018-10-05T08:17:00 endtime=2018-10-05T09:00:00

[slurm-users] job_submit.lua example (routes to alternative partitions based off GPU reservations and core requirements)

2018-10-15 Thread Renfro, Michael
Hey, folks. Been working on a job submit filter to let us use otherwise idle cores in our GPU nodes. We’ve got 40 non-GPU nodes and 4 GPU nodes deployed, each has 28 cores. We’ve had a set of partitions for the non-GPU nodes (batch, interactive, and debug), and another set of partitions for the

Re: [slurm-users] Accounting: set default account with no access

2018-11-05 Thread Renfro, Michael
>From https://stackoverflow.com/a/46176694: >> I had the same requirement to force users to specify accounts and, after >> finding several ways to fulfill it with slurm, I decided to revive this post >> with the shortest/easiest solution. >> >> The slurm lua submit plugin sees the job descripti

Re: [slurm-users] Account not permitted to use this partition

2018-12-03 Thread Renfro, Michael
What does scontrol show partition EMERALD give you? I’m assuming its AllowAccounts output won’t match your /etc/slurm/parts settings. > On Dec 2, 2018, at 12:34 AM, Mahmood Naderan wrote: > > Hi > Although I have created an account and associated that to a partition, but > the submitted job re

Re: [slurm-users] CPU & memory usage summary for a job

2018-12-09 Thread Renfro, Michael
For the simpler questions (for the overall job step, not real-time), you can 'sacct --format=all’ to get data on completed jobs, and then: - compare the MaxRSS column to the ReqMem column to see how far off their memory request was - compare the TotalCPU column to the product of the NCPUS and El

Re: [slurm-users] requesting resources and afterwards launch an array of calculations

2018-12-19 Thread Renfro, Michael
Literal job arrays are built into Slurm: https://slurm.schedmd.com/job_array.html Alternatively, if you wanted to allocate a set of CPUs for a parallel task, and then run a set of single-CPU tasks in the same job, something like: #!/bin/bash #SBATCH --ntasks=30 srun --ntasks=${SLURM_NTASK

Re: [slurm-users] salloc with bash scripts problem

2019-01-02 Thread Renfro, Michael
Not sure what the reasons behind “have to manually ssh to a node”, but salloc and srun can be used to allocate resources and run commands on the allocated resources: Before allocation, regular commands run locally, and no Slurm-related variables are present: = [renfro@login ~]$ hostname l

Re: [slurm-users] salloc with bash scripts problem

2019-01-03 Thread Renfro, Michael
Those errors appear to pop up when qemu can’t find enough RAM to run. If the #SBATCH lines are only applicable for ‘sbatch' and not ‘srun' or ‘salloc', the ‘--mem=8G' setting there doesn’t affect anything. - Does the srun version of the command work if you specify 'qemu-system-x86_64 -m 2048' o

[slurm-users] DenyOnLimit flag ignored for QOS, always rejects?

2019-01-25 Thread Renfro, Michael
Hey, folks. Running 17.02.10 with Bright Cluster Manager 8.0. I wanted to limit queue-stuffing on my GPU nodes, similar to what AssocGrpCPURunMinutesLimit does. The current goal is to restrict a user to having 8 active or queued jobs in the production GPU partition, and block (not reject) other

Re: [slurm-users] DenyOnLimit flag ignored for QOS, always rejects?

2019-01-25 Thread Renfro, Michael
) 150677 gpu omp_hw.sh renfro R 0:06 11 4000M gpunode001 (null) $ scancel -u $USER -p gpu > On Jan 25, 2019, at 10:35 AM, Renfro, Michael wrote: > > Hey, folks. Running 17.02.10 with Bright Cluster Manager 8.0. > > I wanted to limit queue-stu

Re: [slurm-users] Assigning a QOS to a partition?

2019-01-30 Thread Renfro, Michael
In case you haven’t already done something similar, I reduced some of the cumbersome-ness of my job_submit.lua by breaking it out into subsidiary functions, and adding some logic to detect if I was in test mode or not. Basic structure, with subsidiary functions defined ahead of slurm_job_submit(

Re: [slurm-users] Slurm configuration on multi computers with ldap and dedicated resources

2019-02-11 Thread Renfro, Michael
I’m assuming you have LDAP and Slurm already working on all your nodes, and want to restrict access to two of the nodes based off of Unix group membership, while letting all users access the rest of the nodes. If that’s the case, you should be able to put the two towers into a separate partitio

Re: [slurm-users] Define variables within slurm script

2019-02-18 Thread Renfro, Michael
If you’re literally putting spaces around the ‘=‘ character, I don’t think that’s valid shell syntax, and should throw errors into your slurm-JOBID.out file when you try it. See if it works with A=1.0 instead of A = 1.0 > On Feb 18, 2019, at 7:55 AM, Castellana Michele > wrote: > > External

Re: [slurm-users] How to deal with jobs that need to be restarted several time

2019-03-12 Thread Renfro, Michael
If the failures happen right after the job starts (or close enough), I’d use an interactive session with srun (or some other wrapper that calls srun, such as fisbatch). Our hpcshell wrapper for srun is just a bash function: = hpcshell () { srun --partition=interactive $@ --pty bash -i

Re: [slurm-users] GPUs as resources which SLURM can control

2019-03-20 Thread Renfro, Michael
I think all you’re looking for is Generic Resource (GRES) scheduling, starting at https://slurm.schedmd.com/gres.html — if you’ve already seen that, then more details would be helpful. If it all works correctly, then ‘sbatch --gres=gpu scriptname’ should run up to 4 of those jobs and leave the

Re: [slurm-users] Not able to allocate all 24 ntasks-per-node; slurm.conf appears correct

2019-03-27 Thread Renfro, Michael
Can a second user allocate anything on node fl01 after the first user requests their 12 tasks per node? If not, then it looks like tasks are being tied to physical cores, and not a hyperthreaded version of a core. -- Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services

Re: [slurm-users] Limit concurrent gpu resources

2019-04-24 Thread Renfro, Michael
We put a ‘gpu’ QOS on all our GPU partitions, and limit jobs per user to 8 (our GPU capacity) via MaxJobsPerUser. Extra jobs get blocked, allowing other users to queue jobs ahead of the extras. # sacctmgr show qos gpu format=name,maxjobspu Name MaxJobsPU -- - gpu

Re: [slurm-users] Where to adjust the memory limit from sinfo vs free command?

2019-05-16 Thread Renfro, Michael
Should be set on your NodeName lines in slurm.conf. For a 256 GB node, I’ve got: NodeName=node038 CoresPerSocket=14 RealMemory=254000 Sockets=2 ThreadsPerCore=1 so that users can’t reserve every bit of physical memory, leaving a small amount for OS operation. > On May 16, 2019, at 3:47 PM,

Re: [slurm-users] spawning a new terminal for each srun

2019-06-29 Thread Renfro, Michael
Is this output file being written to a central file server that can be accessed from your submit host? If so, start another ssh session from your local computer to the submit host. Is the output file being written to a location only accessible from the compute node running your job? You might b

Re: [slurm-users] Specify number of cores only?

2019-07-10 Thread Renfro, Michael
ntasks=N as an argument to sbatch or srun? Should work as long as you don’t have exclusive node settings. From our setup: [renfro@login ~]$ hpcshell --ntasks=16 # hpcshell is a shell function for 'srun --partition=interactive $@ --pty bash -i' [renfro@gpunode001(job 202002) ~]$ srun hostname | s

Re: [slurm-users] Heterogeneous HPC

2019-09-19 Thread Renfro, Michael
MATLAB container at NVIDIA’s NGC: https://ngc.nvidia.com/catalog/containers/partners:matlab Should be compatible with Docker and Singularity, but read the fine print on licensing. > On Sep 19, 2019, at 8:22 AM, Thomas M. Payerle wrote: > > While I agree containers can be quite useful in HPC e

Re: [slurm-users] Heterogeneous HPC

2019-09-19 Thread Renfro, Michael
Never used Rocks, but as far as Slurm or anything else is concerned, Singularity is just another program. It will need to be accessible from any compute nodes you want to use it on (whether that’s from OS-installed packages, from a shared NFS area, or whatever shouldn’t matter). So your user wi

Re: [slurm-users] Status of BLCR?

2019-10-04 Thread Renfro, Michael
DMTCP might be an option? Pretty sure there are RPMs for it in RHEL/CentOS 7. Don’t recall it being any trouble to install. http://dmtcp.sourceforge.net/ On Oct 4, 2019, at 9:47 PM, Eliot Moss mailto:m...@cs.umass.edu>> wrote: Dear slurm users -- I'm new to slurm (somewhat experienced with Gr

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-07 Thread Renfro, Michael
Our cgroup settings are quite a bit different, and we don’t allow jobs to swap, but the following works to limit memory here (I know, because I get emails frequent emails from users who don’t change their jobs from the default 2 GB per CPU that we use): CgroupMountpoint="/sys/fs/cgroup" CgroupA

Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

2019-11-13 Thread Renfro, Michael
Pretty sure you don’t need to explicitly specify GPU IDs on a Gromacs job running inside of Slurm with gres=gpu. Gromacs should only see the GPUs you have reserved for that job. Here’s a verification code you can run to verify that two different GPU jobs see different GPU devices (compile with

Re: [slurm-users] slurm reporting

2019-11-26 Thread Renfro, Michael
> • Total number of jobs submitted by user (daily/weekly/monthly) > • Average queue time per user (daily/weekly/monthly) > • Average job run time per user (daily/weekly/monthly) Open XDMoD for these three. https://github.com/ubccr/xdmod , plus https://xdmod.ccr.buffalo.edu (unfo

Re: [slurm-users] slurm reporting

2019-11-26 Thread Renfro, Michael
D this morning while "searching" for further info... Would Grafana do similar job as XDMoD? -Original Message- From: slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> On Behalf Of Renfro, Michael Sent: 26 November 2019 16:14 To: Slurm User Community Li

Re: [slurm-users] Slurm configuration, Weight Parameter

2019-11-30 Thread Renfro, Michael
We’ve been using that weighting scheme for a year or so, and it works as expected. Not sure how Slurm would react to multiple NodeName=DEFAULT lines like you have, but here’s our node settings and a subset of our partition settings. In our environment, we’d often have lots of idle cores on GPU

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread Renfro, Michael
What do you get from systemctl status slurmdbd systemctl status slurmctld I’m assuming at least slurmdbd isn’t running. > On Dec 10, 2019, at 3:05 PM, Dean Schulze wrote: > > External Email Warning > This email originated from outside the university. Please use caution when > opening attachme

Re: [slurm-users] Lua jobsubmit plugin for cons_tres ?

2019-12-11 Thread Renfro, Michael
Snapshot of a job_submit.lua we use to automatically to route jobs to a GPU partition if the user asks for a GPU: https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5 All our users just use srun or sbatch with a default queue, and the plugin handles it from there. There’s more de

[slurm-users] Upgraded Slurm 17.02 to 19.05, now GRPTRESRunMin limits are applied incorrectly

2019-12-16 Thread Renfro, Michael
Hey, folks. I’ve just upgraded from Slurm 17.02 (way behind schedule, I know) to 19.05. The only thing I’ve noticed going wrong is that my user resource limits aren’t being applied correctly. My typical user has a GrpTRESRunMin limit of cpu=144 (1000 CPU-days), and after the upgrade, it app

Re: [slurm-users] Upgraded Slurm 17.02 to 19.05, now GRPTRESRunMin limits are applied incorrectly

2019-12-16 Thread Renfro, Michael
tool prints nicely user limits from the Slurm database: > https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits > > Maybe this can give you further insights into the source of problems. > > /Ole > > On 16-12-2019 17:27, Renfro, Michael wrote: >> Hey,

Re: [slurm-users] Upgraded Slurm 17.02 to 19.05, now GRPTRESRunMin limits are applied incorrectly

2019-12-16 Thread Renfro, Michael
Resolved now. On older versions of Slurm, I could have queues without default times specified (just an upper limit, in my case). As of Slurm 18 or 19, I had to add a default time to all my queues to avoid the AssocGrpCPURunMinutesLimit flag. > On Dec 16, 2019, at 2:00 PM, Renfro, Mich

Re: [slurm-users] Partition question

2019-12-19 Thread Renfro, Michael
My current batch queues have a 30-day limit, and I’ll likely be reducing that to maybe 7 days for most users in the near future, as it will make priority and fairshare mechanisms more responsive (even if a high-priority job gets bumped to the top of the queue, it may still have to wait a few day

[slurm-users] Useful script: estimating how long until the next blocked job starts

2020-01-23 Thread Renfro, Michael
Hey, folks. Some of my users submit job after job with no recognition of our 1000 CPU-day TRES limit, and thus their later jobs get blocked with the reason AssocGrpCPURunMinutesLimit. I’ve written up a script [1] using Ole Holm Nielsen’s showuserlimits script [2] that will identify a user’s sm

Re: [slurm-users] Question about slurm source code and libraries

2020-01-24 Thread Renfro, Michael
The slurm-web project [1] has a REST API [2]. Never used it myself, just used the regular web frontend for viewing queue and node state. [1] https://edf-hpc.github.io/slurm-web/index.html [2] https://edf-hpc.github.io/slurm-web/api.html > On Jan 24, 2020, at 1:22 PM, Dean Schulze wrote: > > Ex

Re: [slurm-users] MaxJobs-limits

2020-01-28 Thread Renfro, Michael
For the first question: you should be able to define each node’s core count, hyperthreading, or other details in slurm.conf. That would allow Slurm to schedule (well-behaved) tasks to each node without anything getting overloaded. For the second question about jobs that aren’t well-behaved (a jo

Re: [slurm-users] Virtual memory size requested by slurm

2020-01-28 Thread Renfro, Michael
On this part, I don’t think that’s always the case. On a node with 384 GB (with 2 GB reserved for the OS), we’ve got several jobs running under mem=32000: = $ grep 'NodeName=gpunode\[00' /etc/slurm/slurm.conf NodeName=gpunode[001-003] CoresPerSocket=14 RealMemory=382000 Sockets=2 ThreadsPe

Re: [slurm-users] MaxJobs-limits

2020-01-29 Thread Renfro, Michael
ups is the solution I suppose. > > On Tue, Jan 28, 2020 at 7:42 PM Renfro, Michael wrote: > For the first question: you should be able to define each node’s core count, > hyperthreading, or other details in slurm.conf. That would allow Slurm to > schedule (well-behaved) tas

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread Renfro, Michael
Greetings, fellow general university resource administrator. Couple things come to mind from my experience: 1) does your serial partition share nodes with the other non-serial partitions? 2) what’s your maximum job time allowed, for serial (if the previous answer was “yes”) and non-serial parti

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread Renfro, Michael
ot; the system. The larger jobs at the > expense of the small fry for example, however that is a difficult decision > that means that someone has got to wait longer for results.. > > Best regards, > David > From: slurm-users on behalf of > Renfro, Michael > Sent

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread Renfro, Michael
early > release of v18. > > Best regards, > David > > From: slurm-users on behalf of > Renfro, Michael > Sent: 31 January 2020 17:23:05 > To: Slurm User Community List > Subject: Re: [slurm-users] Longer queuing times for larger jobs > > I missed reading w

Re: [slurm-users] Limits to partitions for users groups

2020-02-05 Thread Renfro, Michael
If you want to rigidly define which 20 nodes are available to the one group of users, you could define a 20-node partition for them, and a 35-node partition for the priority group, and restrict access by Unix group membership: PartitionName=restricted Nodes=node0[01-20] AllowGroups=ALL Partition

Re: [slurm-users] Using "Nodes" on script - file ????

2020-02-12 Thread Renfro, Michael
Hey, Matthias. I’m having to translate a bit, so if I get a meaning wrong, please correct me. You should be able to set the minimum and maximum number of nodes used for jobs on a per-partition basis, or to set a default for all partitions. My most commonly used partition has: PartitionName=b

Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread Renfro, Michael
If that 32 GB is main system RAM, and not GPU RAM, then yes. Since our GPU nodes are over-provisioned in terms of both RAM and CPU, we end up using the excess resources for non-GPU jobs. If that 32 GB is GPU RAM, then I have no experience with that, but I suspect MPS would be required. > On Fe

Re: [slurm-users] Problem with configuration CPU/GPU partitions

2020-02-28 Thread Renfro, Michael
When I made similar queues, and only wanted my GPU jobs to use up to 8 cores per GPU, I set Cores=0-7 and 8-15 for each of the two GPU devices in gres.conf. Have you tried reducing those values to Cores=0 and Cores=20? > On Feb 27, 2020, at 9:51 PM, Pavel Vashchenkov wrote: > > External Email

Re: [slurm-users] Should there be a different gres.conf for each node?

2020-03-05 Thread Renfro, Michael
We have a shared gres.conf that includes node names, which should have the flexibility to specify node-specific settings for GPUs: = NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia0 COREs=0-7 NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia1 COREs=8-15 = See the th

Re: [slurm-users] Issue with "hetjob" directive with heterogeneous job submission script

2020-03-05 Thread Renfro, Michael
I’m going to guess the job directive changed between earlier releases and 20.02. An version of the page from last year [1] has no mention of hetjob, and uses packjob instead. On a related note, is there a canonical location for older versions of Slurm documentation? My local man pages are alway

Re: [slurm-users] Upgrade paths

2020-03-11 Thread Renfro, Michael
The release notes at https://slurm.schedmd.com/archive/slurm-19.05.5/news.html indicate you can upgrade from 17.11 or 18.08 to 19.05. I didn’t find equivalent release notes for 17.11.7, but upgrades over one major release should work. > On Mar 11, 2020, at 2:01 PM, Will Dennis wrote: > > Exter

Re: [slurm-users] Limit Number of Jobs per User in Queue?

2020-03-18 Thread Renfro, Michael
In addition to Sean’s recommendation, your user might want to use job arrays [1]. That’s less stress on the scheduler, and throughput should be equivalent to independent jobs. [1] https://slurm.schedmd.com/job_array.html -- Mike Renfro, PhD / HPC Systems Administrator, Information Technology S

Re: [slurm-users] Can slurm be configured to only run one job at a time?

2020-03-23 Thread Renfro, Michael
Rather than configure it to only run one job at a time, you can use job dependencies to make sure only one job of a particular type at a time. A singleton dependency [1, 2] should work for this. From [1]: #SBATCH --dependency=singleton --job-name=big-youtube-upload in any job script would ens

Re: [slurm-users] Running an MPI job across two partitions

2020-03-23 Thread Renfro, Michael
Others might have more ideas, but anything I can think of would require a lot of manual steps to avoid mutual interference with jobs in the other partitions (allocating resources for a dummy job in the other partition, modifying the MPI host list to include nodes in the other partition, etc.).

Re: [slurm-users] Job are pending when plenty of resources available

2020-03-30 Thread Renfro, Michael
All of this is subject to scheduler configuration, but: what has job 409978 requested, in terms of resources and time? It looks like it's the highest priority pending job in the interactive partition, and I’d expect the interactive partition has a higher priority than the regress partition. As

Re: [slurm-users] Need to calculate total runtime/walltime for one year

2020-04-11 Thread Renfro, Michael
Unless I’m misreading it, you have a wall time limit of 2 days, and jobs that use up to 32 CPUs. So a total CPU time of up to 64 CPU-days would be possible for a single job. So if you want total wall time for jobs instead of CPU time, then you’ll want to use the Elapsed attribute, not CPUTime.

Re: [slurm-users] [EXTERNAL] Follow-up-slurm-users Digest, Vol 30, Issue 32

2020-04-17 Thread Renfro, Michael
Can’t speak for everyone, but I went to Slurm 19.05 some months back, and haven't had any problems with CUDA 10.0 or 10.1 (or 8.0, 9.0, or 9.1). > On Apr 17, 2020, at 8:46 AM, Lisa Kay Weihl wrote: > > External Email Warning > > This email originated from outside the university. Please use cau

Re: [slurm-users] One node is not used by slurm

2020-04-19 Thread Renfro, Michael
Someone else might see more than I do, but from what you’ve posted, it’s clear that compute-0-0 will be used only after other lower-weighted nodes are too full to accept a particular job. I assume you’ve already submitted a set of jobs requesting enough resources to fill up all the nodes, and t

Re: [slurm-users] one job at a time - how to set?

2020-04-29 Thread Renfro, Michael
That’s a *really* old version, but https://slurm.schedmd.com/archive/slurm-15.08.13/sbatch.html indicates there’s an exclusive flag you can set. On Apr 29, 2020, at 1:54 PM, Rutger Vos wrote: . Hi, for a smallish machine that has been having degraded performance we want to implement a pol

Re: [slurm-users] one job at a time - how to set?

2020-04-30 Thread Renfro, Michael
#x27;d have to specify this when submitting, right? I.e. 'sbatch > --exclusive myjob.sh', if I understand correctly. Would there be a way to > simply enforce this, i.e. at the slurm.conf level or something? > > Thanks again! > > Rutger > > On Wed, Apr 29, 2020 at

Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-04 Thread Renfro, Michael
Assuming you need a scheduler for whatever size your user population is, so they need literal JupyterHub, or would they all be satisfied running regular Jupyter notebooks? On May 4, 2020, at 7:25 PM, Lisa Kay Weihl wrote:  External Email Warning This email originated from outside the univer

Re: [slurm-users] how to restrict jobs

2020-05-05 Thread Renfro, Michael
Have you seen https://slurm.schedmd.com/licenses.html already? If the software is just for use inside the cluster, one Licenses= line in slurm.conf plus users submitting with the -L flag should suffice. Should be able to set that license value is 4 if it’s licensed per node and you can run up to

Re: [slurm-users] how to restrict jobs

2020-05-05 Thread Renfro, Michael
ically updated the value based on usage? > > > Regards > Navin. > > > On Tue, May 5, 2020 at 7:00 PM Renfro, Michael wrote: > Have you seen https://slurm.schedmd.com/licenses.html already? If the > software is just for use inside the cluster, one Licenses= line in s

Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-05 Thread Renfro, Michael
Aside from any Slurm configuration, I’d recommend setting up a modules [1 or 2] folder structure for CUDA and other third-party software. That handles LD_LIBRARY_PATH and other similar variables, reduces the chances for library conflicts, and lets users decide their environment on a per-job basi

Re: [slurm-users] how to restrict jobs

2020-05-06 Thread Renfro, Michael
specific > nodes? > i do not want to create a separate partition. > > is there any way to achieve this by any other method? > > Regards > Navin. > > > Regards > Navin. > > On Tue, May 5, 2020 at 7:46 PM Renfro, Michael wrote: > Haven’t done it yet

Re: [slurm-users] how to restrict jobs

2020-05-06 Thread Renfro, Michael
in this case] > > Regards > Navin. > > > On Wed, May 6, 2020 at 7:47 PM Renfro, Michael wrote: > To make sure I’m reading this correctly, you have a software license that > lets you run jobs on up to 4 nodes at once, regardless of how many CPUs you > use? That is, y

Re: [slurm-users] Defining a default --nodes=1

2020-05-08 Thread Renfro, Michael
There are MinNodes and MaxNodes settings that can be defined for each partition listed in slurm.conf [1]. Set both to 1 and you should end up with the non-MPI partitions you want. [1] https://slurm.schedmd.com/slurm.conf.html From: slurm-users on behalf of Ho

[slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
Hey, folks. I've had a 1000 CPU-day (144 CPU-minutes) GrpTRESMins limit applied to each user for years. It generally works as intended, but I have one user I've noticed whose usage is highly inflated from reality, causing the GrpTRESMins limit to be enforced much earlier than necessary: squ

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
e user's limits are printed in detail by showuserlimits. These tools are available from https://github.com/OleHolmNielsen/Slurm_tools /Ole On 08-05-2020 15:34, Renfro, Michael wrote: > Hey, folks. I've had a 1000 CPU-day (144 CPU-minutes) GrpTRESMins > limit applied to each

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
s that already completed, but still get counted against the user's current requests. From: Ole Holm Nielsen Sent: Friday, May 8, 2020 9:27 AM To: slurm-users@lists.schedmd.com Cc: Renfro, Michael Subject: Re: [slurm-users] scontrol show assoc_mgr showing m

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
f,to,pr" > # Get Slurm individual job accounting records using the "sacct" command > sacct $partitionselect -n -X -a -S $start_time -E $end_time -o $FORMAT > -s $STATE > > There are numerous output fields which you can inquire, see "sacct -e". > > /Ole >

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-09 Thread Renfro, Michael
restart. Thanks. > On May 8, 2020, at 11:47 AM, Renfro, Michael wrote: > > Working on something like that now. From an SQL export, I see 16 jobs from > my user that have a state of 7. Both states 3 and 7 show up as COMPLETED in > sacct, and may also have some duplicate job en

Re: [slurm-users] Ubuntu Cluster with Slurm

2020-05-13 Thread Renfro, Michael
I’d compare the RealMemory part of ’scontrol show node abhi-HP-EliteBook-840-G2’ to the RealMemory part of your slurm.conf: > Nodes which register to the system with less than the configured resources > (e.g. too little memory), will be placed in the "DOWN" state to avoid > scheduling jobs on t

Re: [slurm-users] Slurm Job Count Credit system

2020-06-01 Thread Renfro, Michael
Even without the slurm-bank system, you can enforce a limit on resources with a QOS applied to those users. Something like: = sacctmgr add qos bank1 flags=NoDecay,DenyOnLimit sacctmgr modify qos bank1 set grptresmins=cpu=1000 sacctmgr add account bank1 sacctmgr modify account name=bank1 set

Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Renfro, Michael
That’s close to what we’re doing, but without dedicated nodes. We have three back-end partitions (interactive, any-interactive, and gpu-interactive), but the users typically don’t have to consider that, due to our job_submit.lua plugin. All three partitions have a default of 2 hours, 1 core, 2

Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Renfro, Michael
node with oversubscribe should be sufficient. > If you can't spare a single node then a VM would do the job. > > -Paul Edmon- > > On 6/11/2020 9:28 AM, Renfro, Michael wrote: >> That’s close to what we’re doing, but without dedicated nodes. We have three >> back-

Re: [slurm-users] Fairshare per-partition?

2020-06-12 Thread Renfro, Michael
I think that’s correct. From notes I’ve got for how we want to handle our fairshare in the future: Setting up a funded account (which can be assigned a fairshare): sacctmgr add account member1 Description="Member1 Description" FairShare=N Adding/removing a user to/from the funded accoun

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-13 Thread Renfro, Michael
Will probably need more information to find a solution. To start, do you have separate partitions for GPU and non-GPU jobs? Do you have nodes without GPUs? On Jun 13, 2020, at 12:28 AM, navin srivastava wrote: Hi All, In our environment we have GPU. so what i found is if the user having high

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-15 Thread Renfro, Michael
ds Navin On Sat, Jun 13, 2020, 20:37 Renfro, Michael mailto:ren...@tntech.edu>> wrote: Will probably need more information to find a solution. To start, do you have separate partitions for GPU and non-GPU jobs? Do you have nodes without GPUs? On Jun 13, 2020, at 12:28 AM, navin srivas

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-16 Thread Renfro, Michael
Not trying to argue unnecessarily, but what you describe is not a universal rule, regardless of QOS. Our GPU nodes are members of 3 GPU-related partitions, 2 more resource-limited non-GPU partitions, and one of two larger-memory partitions. It’s set up this way to minimize idle resources (due t

Re: [slurm-users] runtime priority

2020-06-30 Thread Renfro, Michael
There’s a --nice flag to sbatch and srun, at least. Documentation indicates it decreases priority by 100 by default. And untested, but it may be possible to use a job_submit.lua [1] to adjust nice values automatically. At least I can see a nice property in [2], which I assume means it'd be acce

Re: [slurm-users] CPU allocation for the GPU jobs.

2020-07-13 Thread Renfro, Michael
“The SchedulerType configuration parameter specifies the scheduler plugin to use. Options are sched/backfill, which performs backfill scheduling, and sched/builtin, which attempts to schedule jobs in a strict priority order within each partition/queue.” https://slurm.schedmd.com/sched_config.ht

Re: [slurm-users] slurm array with non-numeric index values

2020-07-15 Thread Renfro, Michael
If the 500 parameters happened to be filenames, you could do adapt like (appropriated from somewhere else, but I can’t find the reference quickly: = #!/bin/bash # get count of files in this directory NUMFILES=$(ls -1 *.inp | wc -l) # subtract 1 as we have to use zero-based indexing (first e

Re: [slurm-users] Internet connection loss with srun to a node

2020-08-02 Thread Renfro, Michael
Probably unrelated to slurm entirely, and most likely has to do with lower-level network diagnostics. I can guarantee that it’s possible to access Internet resources from a compute node. Notes and things to check: 1. Both ping and http/https are IP protocols, but are very different (ping isn’t

Re: [slurm-users] Correct way to give srun and sbatch different MaxTime values?

2020-08-04 Thread Renfro, Michael
Untested, but you should be able to use a job_submit.lua file to detect if the job was started with srun or sbatch: * Check with (job_desc.script == nil or job_desc.script == '') * Adjust job_desc.time_limit accordingly Here, I just gave people a shell function "hpcshell", which automati

Re: [slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

2020-08-07 Thread Renfro, Michael
I’ve only got 2 GPUs in my nodes, but I’ve always used non-overlapping CPUs= or COREs= settings. Currently, they’re: NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia[0-1] COREs=0-7,9-15 and I’ve got 2 jobs currently running on each node that’s available. So maybe: NodeName=c0005

Re: [slurm-users] scheduling issue

2020-08-14 Thread Renfro, Michael
We’ve run a similar setup since I moved to Slurm 3 years ago, with no issues. Could you share partition definitions from your slurm.conf? When you see a bunch of jobs pending, which ones have a reason of “Resources”? Those should be the next ones to run, and ones with a reason of “Priority” are

Re: [slurm-users] Adding Users to Slurm's Database

2020-08-18 Thread Renfro, Michael
The PowerShell script I use to provision new users adds them to an Active Directory group for HPC, ssh-es to the management node to do the sacctmgr changes, and emails the user. Never had it fail, and I've looped over entire class sections in PowerShell. Granted, there are some inherent delays d

Re: [slurm-users] Jobs getting StartTime 3 days in the future?

2020-08-31 Thread Renfro, Michael
One pending job in this partition should have a reason of “Resources”. That job has the highest priority, and if your job below would delay the highest-priority job’s start, it’ll get pushed back like you see here. On Aug 31, 2020, at 12:13 PM, Holtgrewe, Manuel wrote: Dear all, I'm seeing s

Re: [slurm-users] Question/Clarification: Batch array multiple tasks on nodes

2020-09-01 Thread Renfro, Michael
We set DefMemPerCPU in each partition to approximately the amount of RAM in a node divided by the number of cores in the node. For heterogeneous partitions, we use a lower limit, and we always reserve a bit of RAM for the OS, too. So for a 64 GB node with 28 cores, we default to 2000 M per CPU,

  1   2   >