from:"Greg Wickham"

Re: [slurm-users] extended list of nodes allocated to a job

2023-08-17 Thread Greg Wickham

“sinfo” can expand compressed hostnames too: $ sinfo -n lm602-[08,10] -O NodeHost -h lm602-08 lm602-10 $ -Greg From: slurm-users on behalf of Alain O' Miniussi Date: Thursday, 17 August 2023 at 4:53 pm To: Slurm User Community List Subject: [EXTERNAL] Re: [slurm-users] extended list of n

Re: [slurm-users] [EXTERNAL] Re: slurmdbd database usage

2023-08-02 Thread Greg Wickham

Yup – Slurm is specifically tied to MySQL/MariaDB. To get around this I wrote an C++ application that will extract job records from Slurm using “sacct” and write them into a PostgreSQL database. https://gitlab.com/greg.wickham/sminer The schema used in PostgreSQL is more conduci

Re: [slurm-users] [EXTERNAL] Re: Job in "priority" status - resources available

2023-08-02 Thread Greg Wickham

Following on from what Michael said, the default Slurm configuration is to allocate only one job per node. If GRES a100_1g.10gb is on the same node ensure to enable “SelectType=select/cons_res” (info at https://slurm.schedmd.com/cons_res.html) to permit multiple jobs to use the same node. Also

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-18 Thread Greg Wickham

entation on those. Are you just creating those files and then including them in slurm.conf? Rob From: slurm-users on behalf of Greg Wickham Sent: Wednesday, January 18, 2023 1:38 AM To: Slurm User Community List Subject: Re: [slurm-users] Maintaining slur

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-17 Thread Greg Wickham

Hi Rob, Slurm doesn’t have a “validate” parameter hence one must know ahead of time whether the configuration will work or not. In answer to your question – yes – on our site the Slurm configuration is altered outside of a maintenance window. Depending upon the potential impact of the change,

Re: [slurm-users] [EXTERNAL] SlurmDBD losing connection to the backend MariaDB

2022-11-01 Thread Greg Wickham

rmdbd? Not sure. I have intentionally used the slurmdbd + mariadb in the second node because I didn't want to overload the primary slurmctld. I hope you all are getting the picture of how my set up is. Thanks, RC On 11/1/2022 10:40 AM, Greg Wickham wrote: Hi Richard, Slurmctld caches th

Re: [slurm-users] [EXTERNAL] SlurmDBD losing connection to the backend MariaDB

2022-10-31 Thread Greg Wickham

Hi Richard, Slurmctld caches the updates until slurmdbd comes back online. You can see how many records are pending for the database by using the “sdiag” command and looking for “DBD Agent queue size”. If this number grows significantly it means that slurmdbd isn’t available. -Greg On 01/1

Re: [slurm-users] [EXTERNAL] Ideal NFS exported StateSaveLocation size.

2022-10-23 Thread Greg Wickham

Hi Richard, We have just over 400 nodes and the StateSaveLocation directory has ~600MB of data. The share for SlurmdSpoolDir is about 17GB used across the nodes, but this also includes logs for each node (without log files it’s < 1GB). -Greg On 24/10/2022, 07:19, "slurm-users" wrote: Hi

Re: [slurm-users] [EXTERNAL] Re: gpu utilization of a reserved node

2022-05-07 Thread Greg Wickham

Hi Purvesh, With some caveats, you can do: $ sacct -N -X -S -E -P format=jobid,alloctres And then post process the results with a scripting language. The caveats? . . The -X above is returning the job allocation, which in your case it appears to be everything you need. However for a job or

Re: [slurm-users] [EXTERNAL] Re: Managing shared memory (/dev/shm) usage per job?

2022-04-05 Thread Greg Wickham

Hi John, Mark, We use a spank plugin https://gitlab.com/greg.wickham/slurm-spank-private-tmpdir (this was derived from other authors but modified for functionality required on site). It can bind tmpfs mount points to the users cgroup allocation, additionally bind options can be provided (ie: l

Re: [slurm-users] [EXTERNAL] how to locate the problem when slurm failed to restrict gpu usage of user jobs

2022-03-23 Thread Greg Wickham

If it’s possible to see other GPUs within a job then that means that cgroups aren’t being used. Look at the cgroup documentation of slurm (https://slurm.schedmd.com/cgroup.conf.html) With cgroups activated an `nvidia-smi` will only show the GPU allocated to the job. -greg From: slurm-user

Re: [slurm-users] Can job submit plugin detect "--exclusive" ?

2022-02-18 Thread Greg Wickham

Hi Chris, You mentioned “But trials using this do not seem to be fruitful so far.” . . why? In our job_submit.lua there is: if job_desc.shared == 0 then slurm.user_msg("exclusive access is not permitted with GPU jobs.") slurm.user_msg("Remove '--exclusive' from your job submissi

Re: [slurm-users] [EXTERNAL] Re: Information about finished jobs

2021-06-14 Thread Greg Wickham

As others have commented, some information is lost when it is stored in the database. To keep historically accurate data on the job run a script (refer to PrologSlurmctld in slurm.conf) that runs an "scontrol show -d job " and drops it into a local file. Using " PrologSlurmctld" is neat, as it

Re: [slurm-users] [EXTERNAL] Re: Cluster usage, filtered by partition

2021-05-12 Thread Greg Wickham

Hi Diego, Disclaimer: A little bit of shameless self-promotion. We're using an application I wrote to inject slurm accounting records into a PostreSQL database. The data is extracted from Slurm using "sacct". From there it's possible to use SQL queries to mine the raw slurm data. https://

Re: [slurm-users] Reset TMPDIR for All Jobs

2020-05-19 Thread Greg Wickham

Hi Erik, We use a private fork of https://github.com/hpc2n/spank-private-tmp It has worked quite well for us - jobs (or steps) don’t share a /tmp and during the prolog all files created for the job/step are deleted. Users absolutely cannot see each others temporary files so there’s no issue ev

Re: [slurm-users] QOS cutting off users before CPU limit is reached

2020-05-18 Thread Greg Wickham

Something to try . . If you restart “slurmctld” does the new QOS apply? We had a situation where slurmdbd was running as a different user than slurmctld and hence sacctmgr changes weren’t being reflected in slurmctld. -greg On 27 Apr 2020, at 12:57, Simon Andrews mailto:simon.andr...@babr

[slurm-users] Musing: Can GPUs be restricted by changing ownership permissions?

2019-11-03 Thread Greg Wickham

-GPU nodes and a plethora of 1 GPU jobs - during heavy use the user may not have access to the GPU they require). Has anyone any experience with changing GPU permissions during prolog / epilogue? thanks, -greg -- Dr. Greg Wickham Advanced Computing Infrastructure Team Lead Advanced Computing

Re: [slurm-users] Anyone built PMIX 3.1.1 against Slurm 18.08.4?

2019-01-22 Thread Greg Wickham

hael Di Domenico mailto:mdidomeni...@gmail.com>> wrote: i've seen the same error, i don't think it's you. but i don't know what the cause is either, i didn't have time to look into it so i backed up to pmix 2.2.1 which seems to work fine On Tue, Jan 22, 2019 at 1

[slurm-users] Anyone built PMIX 3.1.1 against Slurm 18.08.4?

2019-01-21 Thread Greg Wickham

Hi All, I’m trying to build pmix 3.1.1 against slurm 18.08.4, however in the slurm pmix plugin I get a fatal error: pmixp_client.c:147:28: error: ‘flag’ undeclared (first use in this function) PMIX_VAL_SET(&kvp->value, flag, 0); Is there something wrong with my build environme

Re: [slurm-users] maintenance partitions?

2018-10-05 Thread Greg Wickham

de in the prod partition to drain without affecting the > node status in the maint partition. I don't believe I can do this > though. I believe i have to change the slurm.conf and reconfigure to > add/remove nodes from one partition or the other > > if anyone has a better solut

[slurm-users] Best practice: How much node memory to specify in slurm.conf?

2018-01-16 Thread Greg Wickham

there a recommended “kernel overhead” memory (either % or absolute value) that we should deduct from the total physical memory? thanks, -greg -- Dr. Greg Wickham Advanced Computing Infrastructure Team Lead Advanced Computing Core Laboratory King Abdullah University of Science and Technology

Re: [slurm-users] extended list of nodes allocated to a job

Re: [slurm-users] [EXTERNAL] Re: slurmdbd database usage

Re: [slurm-users] [EXTERNAL] Re: Job in "priority" status - resources available

Re: [slurm-users] Maintaining slurm config files for test and production clusters

Re: [slurm-users] Maintaining slurm config files for test and production clusters

Re: [slurm-users] [EXTERNAL] SlurmDBD losing connection to the backend MariaDB

Re: [slurm-users] [EXTERNAL] SlurmDBD losing connection to the backend MariaDB

Re: [slurm-users] [EXTERNAL] Ideal NFS exported StateSaveLocation size.

Re: [slurm-users] [EXTERNAL] Re: gpu utilization of a reserved node

Re: [slurm-users] [EXTERNAL] Re: Managing shared memory (/dev/shm) usage per job?

Re: [slurm-users] [EXTERNAL] how to locate the problem when slurm failed to restrict gpu usage of user jobs

Re: [slurm-users] Can job submit plugin detect "--exclusive" ?

Re: [slurm-users] [EXTERNAL] Re: Information about finished jobs

Re: [slurm-users] [EXTERNAL] Re: Cluster usage, filtered by partition

Re: [slurm-users] Reset TMPDIR for All Jobs

Re: [slurm-users] QOS cutting off users before CPU limit is reached

[slurm-users] Musing: Can GPUs be restricted by changing ownership permissions?

Re: [slurm-users] Anyone built PMIX 3.1.1 against Slurm 18.08.4?

[slurm-users] Anyone built PMIX 3.1.1 against Slurm 18.08.4?

Re: [slurm-users] maintenance partitions?

[slurm-users] Best practice: How much node memory to specify in slurm.conf?

21 matches

Site Navigation

Mail list logo

Footer information