[slurm-users] Potential Side Effects of larger MessageTimeout value

2024-01-18 Thread Herc Silverstein
Hi, What are potential bad side effects of using a large/larger MessageTimeout? And is there a value at which this setting is too large (long)? Thanks, Herc

[slurm-users] Job fails while running with Reason AssocMaxJobsLimit

2023-05-31 Thread Herc Silverstein
Hi, We have a job that ran for 8 seconds, then failed with the Reason showing as AssocMaxJobsLimit. In our case we have MaxJobs for each user set to 5000.  My understanding was that if the user submitted > 5000 jobs, slurm would only run 5000.  The other jobs would just wait. If that's corre

Re: [slurm-users] do oversubscription with algorithm other than least-loaded?

2022-03-07 Thread Herc Silverstein
ing to something else entirely, could you   elaborate on the least-loaded configuration in your setup?    On 24/02/2022 23:35:30, Herc   Silverstein wrote:   cite="mid:3145b0e8-6ae0-f233-5080-36cdbba66...@schrodinger.com">      Hi,   

[slurm-users] do oversubscription with algorithm other than least-loaded?

2022-02-24 Thread Herc Silverstein
Hi, We would like to do over-subscription on a cluster that's running in the cloud.  The cluster dynamically spins up and down cpu nodes as needed.  What we see is that the least-loaded algorithm causes the maximum number of nodes specified in the partition to be spun up and each loaded with N

[slurm-users] Use Task affinity on a per partition basis?

2021-06-07 Thread Herc Silverstein
Hi, Is there a way to use task affinity on a per-partition basis?  We couldn't find anything in the docs that described doing this.  And our attempts to specify this on a per partition basis failed. Thanks, Herc

Re: [slurm-users] nodes going to down* and getting stuck in, that state

2021-06-04 Thread Herc Silverstein
Hi, The slurmctld.log shows (for this node): ... [2021-05-25T00:12:27.481] sched: Allocate JobId=3402729 NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand [2021-05-25T00:12:27.482] sched: Allocate JobId=3402730 NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondem

[slurm-users] nodes going to down* and getting stuck in that state

2021-05-19 Thread Herc Silverstein
Hi, We have a cluster (in Google gcp) which has a few partitions set up to auto-scale, but one partition is set up to not autoscale. The desired state is for all of the nodes in this non-autoscaled partition (SuspendExcParts=gpu-t4-4x-ondemand) to continue running uninterrupted.  However, we

Re: [slurm-users] prolog not passing env var to job

2021-02-12 Thread Herc Silverstein
:27 PM, mercan wrote: Hi; Prolog and TaskProlog are different parameters and scripts. You should use the TaskProlog script to set env. variables. Regards; Ahmet M. 13.02.2021 00:12 tarihinde Herc Silverstein yazdı: Hi, I have a prolog script that is being run via the slurm.conf Prolog= setting.

[slurm-users] prolog not passing env var to job

2021-02-12 Thread Herc Silverstein
Hi, I have a prolog script that is being run via the slurm.conf Prolog= setting.  I've verified that it's being executed on the compute node.  My problem is that I cannot get environment variables that I set in this prolog to be set/seen in the job. For example the prolog: #!/bin/bash ...