from:"Gerhard Strangar"

[slurm-users] Re: Large memory jobs stuck Pending. Should use --time parameter?

2025-05-06 Thread Gerhard Strangar via slurm-users

Mike via slurm-users wrote: > None of our users specify a --time param If that results in a runtime of more than 28 days, the scheduler does not reserve slots. -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Change a job from --exclusive to --exclusive=user

2024-09-18 Thread Gerhard Strangar via slurm-users

Hello, is it possible to change a pending job from --exclusive to --exclusive=user? I tried scontrol update jobid=... oversubscribe=user, but it seems to only accept yes or no. Gerhard -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...

[slurm-users] Re: Limit GPU depending on type

2024-06-13 Thread Gerhard Strangar via slurm-users

Gestió Servidors via slurm-users wrote: > What I want is users could user all of them but simultaniously, a user only > could use one of the RTX3080. How about two partitions: One contains only the RTX3080, using the QoS MaxTRESPerUser=gres/gpu=1 and another one with all the other GPUs not havin

[slurm-users] Avoiding fragmentation

2024-04-08 Thread Gerhard Strangar via slurm-users

Hi, I'm trying to figure out how to deal with a mix of few- and many-cpu jobs. By that I mean most jobs use 128 cpus, but sometimes there are jobs with only 16. As soon as that job with only 16 is running, the scheduler splits the next 128 cpu jobs into 96+16 each, instead of assigning a full 128

[slurm-users] Re: Suggestions for Partition/QoS configuration

2024-04-04 Thread Gerhard Strangar via slurm-users

thomas.hartmann--- via slurm-users wrote: > My idea was to basically have three partitions: > > 1. PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99] > PriorityTier=100 > 2. PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50] > PriorityTier=100 > 3. PartitionNam

[slurm-users] Re: "Optimal" slurm configuration

2024-02-26 Thread Gerhard Strangar via slurm-users

Max Grönke via slurm-users wrote: > (b) introduce a "small" partition for the <4h jobs with higher priority but > we're unsure if this will block all the larger jobs to run... Just limit the number of cpus in that partition. Gerhard -- slurm-users mailing list -- slurm-users@lists.schedmd.com

[slurm-users] Memory used per node

2024-02-09 Thread Gerhard Strangar via slurm-users

Hello, I'm wondering if there's a way to tell how much memory my job is using per node. I'm doing #SBATCH -n 256 srun solver inputfile When I run sacct -o maxvmsize, the result apparently is the maxmimum VSZ of the largest solver process, not the maximum of the sum of them all (unlike when calli

Re: [slurm-users] How to run one maintenance job on each node in the cluster

2023-12-23 Thread Gerhard Strangar

Jeffrey Tunison wrote: > Is there a straightforward way to create a batch job that runs once on every > node in the cluster? A wrapper around reboot configured as RebootProgram in slurm.conf?

Re: [slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Gerhard Strangar

Laurence Marks wrote: > After some (irreproducible) time, often one of the three slow tasks hangs. > A symptom is that if I try and ssh into the main node of the subtask (which > is running 128 mpi on the 4 nodes) I get "Authentication failed". How about asking an admin to check why it hangs?

Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-13 Thread Gerhard Strangar

Tim Wickberg wrote: > A number of race conditions have been identified within the > slurmd/slurmstepd processes that can lead to the user taking ownership > of an arbitrary file on the system. Is it any different than the CVE-2023-41915 in PMIx or does it just have an additional number but it's

Re: [slurm-users] Aborting a job from inside the prolog

2023-06-20 Thread Gerhard Strangar

Alexander Grund wrote: > Although it may be better to not drain it, I'm a bit nervous with "exit > 0" as it is very important that the job does not start/continue, i.e. > the user code (sbatch script/srun) is never executed in that case. > So I want to be sure that an `scancel` on the job in its

Re: [slurm-users] Aborting a job from inside the prolog

2023-06-19 Thread Gerhard Strangar

Alexander Grund wrote: > Our first approach with `scancel $SLURM_JOB_ID; exit 1` doesn't seem to > work as the (sbatch) job still gets re-queued. Try to exit with 0, because it's not your prolog that failed.

Re: [slurm-users] Does Slurm have any equivalent to LSF elim for generating dynamic node resources

2023-03-03 Thread Gerhard Strangar

Amir Ben Avi wrote: > I have looked on the Slurm documentation, but didn't found any way to crate > resource dynamically ( in a script ) on the node level Well, basically you could do something like scontrol update nodename=$HOSTNAME Gres=myres:367. What you don't have is decaying resource reser

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-03 Thread Gerhard Strangar

Phil Chiu wrote: >- Individual slurm jobs which reboot nodes - With a for loop, I could >submit a reboot job for each node. But I'm not sure how to limit this so at >most N jobs are running simultaneously. With a fake license called reboot?

Re: [slurm-users] Jobs fail on specific nodes.

2022-05-25 Thread Gerhard Strangar

Roger Mason wrote: > I would appreciate any suggestions on what might be causing this problem > or what I can do to diagnose it. Run getent hosts node012 on all hosts to see which one can't resolve it.

Re: [slurm-users] Sharing a GPU

2022-04-03 Thread Gerhard Strangar

Eric F. Alemany wrote: > Another solution would be the vNVIDIA GPU > (Virtual GPU manager software). > You can share GPU among VM’s You can really *share* one, not just delegate one GPU to one VM?

Re: [slurm-users] Limit partition to 1 job at a time

2022-03-23 Thread Gerhard Strangar

Russell Jones wrote: > I suppose I am confused about how GrpJobs works. The manual shows: > > The total number of jobs able to run at any given time from an association > and its children QOS > > > It is my understanding an association is cluster + account + user. Would > this not just limit it

Re: [slurm-users] Limit partition to 1 job at a time

2022-03-22 Thread Gerhard Strangar

Russell Jones wrote: > I am struggling to figure out how to do this. Any tips? Create a QoS with GrpJobs=1 and assign it to the partition?

Re: [slurm-users] How to checkout a slurm node?

2021-11-12 Thread Gerhard Strangar

Joe Teumer wrote: > However, if the user needs to reboot the node, set BIOS settings, etc then > `salloc` automatically terminates the allocation when the new shell is What kind of BIOS settings would a user need to change?

Re: [slurm-users] how to check what slurm is doing when job pending with reason=none?

2021-06-16 Thread Gerhard Strangar

taleinterve...@sjtu.edu.cn wrote: > But after submit, this job still stay at PENDING state for about 30-60s and > during the pending time sacct shows the REASON is "None". It's the default sched_interval=60 in your slurm.conf. Gerhard

[slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Gerhard Strangar

Hello, how do you implement something like "drain host after 10 consecutive failed jobs"? Unlike a host check script, that checks for known errors, I'd like to stop killing jobs just because one node is faulty. Gerhard

[slurm-users] scancel the solver, not MPI

2020-10-02 Thread Gerhard Strangar

Hi, I'm wondering if it's possible to gracefully terminate a solver that is using MPI. If srun starts the MPI for me, can it tell the solver to terminate and then wait n seconds before it tells MPI to terminate? Or is the only way of handling this using scancel -b and trapping the signal?

Re: [slurm-users] Limit nodes of a partition without managing users

2020-08-17 Thread Gerhard Strangar

Brian Andrus wrote: > Most likely, but the specific approach depends on how you define what > you want. My idea was "high prio job is next unless are are too many of them". > For example, what if there are no jobs in high pri queue but many in > low? Should all the low ones run? Yes. > What s

Re: [slurm-users] [External] Limit nodes of a partition without managing users

2020-08-17 Thread Gerhard Strangar

Prentice Bisbal wrote: >> I'm wondering if it's possible to have slurm 19 run two partitions (low >> and high prio) that share all the nodes and limit the high prio >> partition in number of nodes used simultaneously without requiring to >> manage the users in the database. > Yes, you can do this u

[slurm-users] Limit nodes of a partition without managing users

2020-08-17 Thread Gerhard Strangar

Hello, I'm wondering if it's possible to have slurm 19 run two partitions (low and high prio) that share all the nodes and limit the high prio partition in number of nodes used simultaneously without requiring to manage the users in the database. Any ideas? Regards, Gerhard

Re: [slurm-users] Debugging communication problems

2020-08-06 Thread Gerhard Strangar

Gerhard Strangar wrote: > I'm experiencing a connectivity problem and I'm out of ideas, why this > is happening. I'm running a slurmctld on a multihomed host. > > (10.9.8.0/8) - master - (10.11.12.0/8) > There is no routing between these two subnets. My topolog

[slurm-users] Debugging communication problems

2020-08-04 Thread Gerhard Strangar

Hi, I'm experiencing a connectivity problem and I'm out of ideas, why this is happening. I'm running a slurmctld on a multihomed host. (10.9.8.0/8) - master - (10.11.12.0/8) There is no routing between these two subnets. So far, all slurmds resided in the first subnet and worked fine. I added so

[slurm-users] Re: Large memory jobs stuck Pending. Should use --time parameter?

[slurm-users] Change a job from --exclusive to --exclusive=user

[slurm-users] Re: Limit GPU depending on type

[slurm-users] Avoiding fragmentation

[slurm-users] Re: Suggestions for Partition/QoS configuration

[slurm-users] Re: "Optimal" slurm configuration

[slurm-users] Memory used per node

Re: [slurm-users] How to run one maintenance job on each node in the cluster

Re: [slurm-users] Reproducible irreproducible problem (timeout?)

Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

Re: [slurm-users] Aborting a job from inside the prolog

Re: [slurm-users] Aborting a job from inside the prolog

Re: [slurm-users] Does Slurm have any equivalent to LSF elim for generating dynamic node resources

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

Re: [slurm-users] Jobs fail on specific nodes.

Re: [slurm-users] Sharing a GPU

Re: [slurm-users] Limit partition to 1 job at a time

Re: [slurm-users] Limit partition to 1 job at a time

Re: [slurm-users] How to checkout a slurm node?

Re: [slurm-users] how to check what slurm is doing when job pending with reason=none?

[slurm-users] Draining hosts because of failing jobs

[slurm-users] scancel the solver, not MPI

Re: [slurm-users] Limit nodes of a partition without managing users

Re: [slurm-users] [External] Limit nodes of a partition without managing users

[slurm-users] Limit nodes of a partition without managing users

Re: [slurm-users] Debugging communication problems

[slurm-users] Debugging communication problems

27 matches

Site Navigation

Mail list logo

Footer information