Re: [slurm-users] Slurm Configless error

2023-08-29 Thread Paul Brunk
useful complaint in one of those, whatever the cause. -- Paul Brunk, system administrator Advanced Computing Resource Center Enterprise IT Svcs, the University of Georgia On 8/29/23, 11:29 AM, "slurm-users" wrote: You don't often get email from nicolas.son...@versatushpc.com.br<

Re: [slurm-users] OpenMPI and Slurm clarification?

2023-04-27 Thread Paul Brunk
related libraries. These are a fantastic resource! -- Paul Brunk, system administrator Advanced Computing Resource Center Enterprise IT Svcs, the University of Georgia On 3/27/23, 2:29 PM, "slurm-users" wrote: You don't often get email from cfre...@super.org<mailto:cfre...@supe

[slurm-users] slurmd-used libs in an NFS share?

2023-03-23 Thread Paul Brunk
starting a Slurm cluster" walkthrough threads online lately, but haven't seen this particular thing addressed. I'm aware it might be a non-issue. -- Paul Brunk, system administrator Advanced Computing Resource Center Enterprise IT Svcs, the University of Georgia

Re: [slurm-users] Restarting jobs

2022-08-19 Thread Paul Brunk
ued unless explicitly enabled by the user. Use the sbatch --no-requeue or --requeue option to change the default behavior for individual jobs. The default value is 1. -- Paul Brunk, system administrator Advanced Computing Resource Center Enterprise IT Svcs, the University of Georgia On 8/18/22, 1:57

Re: [slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

2022-02-14 Thread Paul Brunk
Hi: Thanks for your feedback guys :). We continue to find srun behaving properly re: core placement. BTW, we've further established that only MVAPICH (and therefore also Intel MPI) jobs are encountering the OOM issue. == Paul Brunk, system administrator Georgia Advanced Resource Comp

Re: [slurm-users] Possible to have a node in two partitions with N cores in one partition and M cores in the other?

2022-02-11 Thread Paul Brunk
uld make bigger_qos and smaller_qos, and define those as 'QOS' in the matching PartitionName entries. Then add whatever ACL or limits you want to those QOSes. Or use the PartitionName entries if the available options suffice. -- Paul Brunk, system administrator Georgia Advanced Resource Comp

[slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

2022-02-10 Thread Paul Brunk
braries: intel/2019b Observations: - Works correctly when using: 1 node x 64 cores 64 MPI processes), 1x128 (128 MPI processes) (other QE parameters -nk 1 -nt 4 , mem-per-cpu=1500mb) - A few processes get OOM killed after a while when using: 4 nodes x 32 cores (128 MPI processes), 4 nodes x

Re: [slurm-users] slurmctld/slurmdbd filesystem/usermap requirements

2022-02-10 Thread Paul Brunk
m to slurmds at dispatch time, who store them on each node in the slurm.conf 'SlurmdSpoolDir', as Steffen noted. All this to say that the slurmctld host doesn't need to see the users' home dirs and/or job script dirs. == Paul Brunk, system administrator Georgia Advanced Res

Re: [slurm-users] ActiveFeatures job submission

2022-02-09 Thread Paul Brunk
nd no submissions would be rejected based on filesystem availability (since the license stuff can't affect job submission, only dispatch). I'm sure there could be other solutions. I've not thought further on this since I've been happily using NHC for a long time. == Paul Brunk

Re: [slurm-users] Creating groups of nodes with exclusive access to a resources within a partition.

2022-02-09 Thread Paul Brunk
f of them in lua filter, or adding them there) might help too. -- Paul Brunk, system administrator Georgia Advanced Resource Computing Center Enterprise IT Svcs, the University of Georgia On 2/1/22, 5:45 AM, "slurm-users" wrote: [EXTERNAL SENDER - PROCEED CAUTIOUSLY] Hi, I a

Re: [slurm-users] What is the 'Root/Cluster association' level in Resource Limits document mean?

2022-02-09 Thread Paul Brunk
Hi: You can use e.g. 'sacctmgr show -s users', and you'll see each user's cluster assocation as one of the output columns. If the name were 'yourcluster', then you could do: sacctmgr modify cluster name=yourcluster set grpTres="node=8". == Paul Brun

Re: [slurm-users] JobComp file not rotating

2022-02-09 Thread Paul Brunk
an't infer from the log file names (date stamps) which completed job log a given day's jobs will appear in. == Paul Brunk, system administrator Georgia Advanced Resource Computing Center Enterprise IT Svcs, the University of Georgia On 2/8/22, 9:44 AM, "slurm-users" wrote:

Re: [slurm-users] Add new compute node without interruption

2021-12-13 Thread Paul Brunk
Hi: Normally, adding a new node requires altering slurm.conf, and restarting slurmctld, and slurmd on each node. Restarting these daemons should not harm jobs and can be done while existing jobs are running. Wishing that I’d just listened this time, Paul Brunk, system administrator

Re: [slurm-users] AcctGatherProfileType

2021-11-22 Thread Paul Brunk
ment on space consumption. Good luck! -- Wishing that I'd just listened this time, Paul Brunk, system administrator Georgia Advanced Computing Resource Center UGA EITS (formerly UCNS)

Re: [slurm-users] enable_configless, srun and DNS vs. hosts file

2021-11-12 Thread Paul Brunk
arted everywhere). Could this be what you're seeing (as opposed to /etc/hosts vs DNS)? -- Wishing that I'd just listened this time, Paul Brunk, system administrator, Workstation Support Group GACRC (formerly RCC) UGA EITS (formerly UCNS) -Original Message- From: slurm-users On

Re: [slurm-users] "Low RealMem" after upgrade

2021-10-01 Thread Paul Brunk
Hi: If you mean "why are the nodes still Drained, now that I fixed the slurm.conf and restarted (never mind whether the RealMem parameter is correct)?", try 'scontrol update nodename=str957-bl0-0[1-2] State=RESUME'. -- Paul Brunk, system administrator Georgia Advanced Compu

Re: [slurm-users] job stuck as pending - reason "PartitionConfig"

2021-09-29 Thread Paul Brunk
Hello Byron: I’m guessing that your job is asking for more HW than the highmem_p has in it, or more cores or RAM within a node than any of the nodes have, or something like that. 'scontrol show job 10860160' might help. You can also look in slurmctld.log for that jobid. -- Paul Bru

[slurm-users] changing JobAcctGatherType w/running jobs

2021-09-07 Thread Paul Brunk
being inaccurate, and don't yet have e.g. a MaxTresPerX with some RAM value). With our 'cgroup' ProcTrackType, and requiring a mem spec on all jobs, I think we don't need worry if a given slurmd is sending slurmctld wrong or incomprehensible information about a given

[slurm-users] SLUG '21

2021-07-01 Thread Paul Brunk
Hi: It's that time again...we're doing travel budget planning. Do we have a sense of whether or how there will be a user group meeting this year? I saw the April poll. Thanks! -- Grinning like an idiot, Paul Brunk, system administrator Georgia Advanced Computing Resource Cen

Re: [slurm-users] [External] Different max number of jobs in individual and array jobs

2021-06-17 Thread Paul Brunk
it lua to add a request for a license of the relevant type to each submission? -- Flailing wildly at the keyboard, Paul Brunk, system administrator Georgia Advanced Computing Resource Center Enterprise IT Svcs, the University of Georgia From: slurm-users On Behalf Of Prentice Bisbal Sent: Thurs

Re: [slurm-users] Specify a gpu ID

2021-06-03 Thread Paul Brunk
Hi: I've not tried to do that. But the below discussion might help: https://bugs.schedmd.com/show_bug.cgi?id=2626 From: slurm-users On Behalf Of Ahmad Khalifa Sent: Thursday, June 3, 2021 01:12 To: slurm-users@lists.schedmd.com Subject: [slurm-users] Specify a gpu ID [EXTERNAL SENDER - PROC

Re: [slurm-users] [External] Re: exempting a node from Gres Autodetect

2021-03-04 Thread Paul Brunk
ature on a single node, > where it looks like that node isn't using RPMs with NVML support. Indeed, this was a PEBCAK problem--I was not heeding the classic "read the right fine version of the fine manual" (RTRFVOTFM?) advice. Thanks all for your replies. -- Jesting grimly,

[slurm-users] exempting a node from Gres Autodetect

2021-02-19 Thread Paul Brunk
ng/reading /var/lib/slurmd/conf-cache/gres.conf Reverting to the original, one-line gres.conf reverted the cluster to production state. -- Paul Brunk, system administrator Georgia Advanced Computing Resource Center Enterprise IT Svcs, the University of Georgia

[slurm-users] floating condo partition, , no pre-emption, guarantee a max pend time?

2020-04-22 Thread Paul Brunk
and also the management of the reservation's node membership. I don't assume that a good answer resembles that at all. Thanks for any insights! -- Paul Brunk, system administrator Georgia Advanced Computing Resource Center (GACRC) Enterprise IT Svcs, the University of Georgia