[slurm-users] Change a job from --exclusive to --exclusive=user

2024-09-18 Thread Gerhard Strangar via slurm-users
Hello, is it possible to change a pending job from --exclusive to --exclusive=user? I tried scontrol update jobid=... oversubscribe=user, but it seems to only accept yes or no. Gerhard -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...

[slurm-users] Re: Limit GPU depending on type

2024-06-13 Thread Gerhard Strangar via slurm-users
Gestió Servidors via slurm-users wrote: > What I want is users could user all of them but simultaniously, a user only > could use one of the RTX3080. How about two partitions: One contains only the RTX3080, using the QoS MaxTRESPerUser=gres/gpu=1 and another one with all the other GPUs not havin

[slurm-users] Avoiding fragmentation

2024-04-08 Thread Gerhard Strangar via slurm-users
Hi, I'm trying to figure out how to deal with a mix of few- and many-cpu jobs. By that I mean most jobs use 128 cpus, but sometimes there are jobs with only 16. As soon as that job with only 16 is running, the scheduler splits the next 128 cpu jobs into 96+16 each, instead of assigning a full 128

[slurm-users] Re: Suggestions for Partition/QoS configuration

2024-04-04 Thread Gerhard Strangar via slurm-users
thomas.hartmann--- via slurm-users wrote: > My idea was to basically have three partitions: > > 1. PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99] > PriorityTier=100 > 2. PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50] > PriorityTier=100 > 3. PartitionNam

[slurm-users] Re: "Optimal" slurm configuration

2024-02-26 Thread Gerhard Strangar via slurm-users
Max Grönke via slurm-users wrote: > (b) introduce a "small" partition for the <4h jobs with higher priority but > we're unsure if this will block all the larger jobs to run... Just limit the number of cpus in that partition. Gerhard -- slurm-users mailing list -- slurm-users@lists.schedmd.com

[slurm-users] Memory used per node

2024-02-09 Thread Gerhard Strangar via slurm-users
Hello, I'm wondering if there's a way to tell how much memory my job is using per node. I'm doing #SBATCH -n 256 srun solver inputfile When I run sacct -o maxvmsize, the result apparently is the maxmimum VSZ of the largest solver process, not the maximum of the sum of them all (unlike when calli

Re: [slurm-users] How to run one maintenance job on each node in the cluster

2023-12-23 Thread Gerhard Strangar
Jeffrey Tunison wrote: > Is there a straightforward way to create a batch job that runs once on every > node in the cluster? A wrapper around reboot configured as RebootProgram in slurm.conf?

Re: [slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Gerhard Strangar
Laurence Marks wrote: > After some (irreproducible) time, often one of the three slow tasks hangs. > A symptom is that if I try and ssh into the main node of the subtask (which > is running 128 mpi on the 4 nodes) I get "Authentication failed". How about asking an admin to check why it hangs?

Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-13 Thread Gerhard Strangar
Tim Wickberg wrote: > A number of race conditions have been identified within the > slurmd/slurmstepd processes that can lead to the user taking ownership > of an arbitrary file on the system. Is it any different than the CVE-2023-41915 in PMIx or does it just have an additional number but it's

Re: [slurm-users] Aborting a job from inside the prolog

2023-06-20 Thread Gerhard Strangar
Alexander Grund wrote: > Although it may be better to not drain it, I'm a bit nervous with "exit > 0" as it is very important that the job does not start/continue, i.e. > the user code (sbatch script/srun) is never executed in that case. > So I want to be sure that an `scancel` on the job in its

Re: [slurm-users] Aborting a job from inside the prolog

2023-06-19 Thread Gerhard Strangar
Alexander Grund wrote: > Our first approach with `scancel $SLURM_JOB_ID; exit 1` doesn't seem to > work as the (sbatch) job still gets re-queued. Try to exit with 0, because it's not your prolog that failed.

Re: [slurm-users] Does Slurm have any equivalent to LSF elim for generating dynamic node resources

2023-03-03 Thread Gerhard Strangar
Amir Ben Avi wrote: > I have looked on the Slurm documentation, but didn't found any way to crate > resource dynamically ( in a script ) on the node level Well, basically you could do something like scontrol update nodename=$HOSTNAME Gres=myres:367. What you don't have is decaying resource reser

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-03 Thread Gerhard Strangar
Phil Chiu wrote: >- Individual slurm jobs which reboot nodes - With a for loop, I could >submit a reboot job for each node. But I'm not sure how to limit this so at >most N jobs are running simultaneously. With a fake license called reboot?

Re: [slurm-users] Jobs fail on specific nodes.

2022-05-25 Thread Gerhard Strangar
Roger Mason wrote: > I would appreciate any suggestions on what might be causing this problem > or what I can do to diagnose it. Run getent hosts node012 on all hosts to see which one can't resolve it.

Re: [slurm-users] Sharing a GPU

2022-04-03 Thread Gerhard Strangar
Eric F. Alemany wrote: > Another solution would be the vNVIDIA GPU > (Virtual GPU manager software). > You can share GPU among VM’s You can really *share* one, not just delegate one GPU to one VM?

Re: [slurm-users] Limit partition to 1 job at a time

2022-03-23 Thread Gerhard Strangar
Russell Jones wrote: > I suppose I am confused about how GrpJobs works. The manual shows: > > The total number of jobs able to run at any given time from an association > and its children QOS > > > It is my understanding an association is cluster + account + user. Would > this not just limit it

Re: [slurm-users] Limit partition to 1 job at a time

2022-03-22 Thread Gerhard Strangar
Russell Jones wrote: > I am struggling to figure out how to do this. Any tips? Create a QoS with GrpJobs=1 and assign it to the partition?

Re: [slurm-users] How to checkout a slurm node?

2021-11-12 Thread Gerhard Strangar
Joe Teumer wrote: > However, if the user needs to reboot the node, set BIOS settings, etc then > `salloc` automatically terminates the allocation when the new shell is What kind of BIOS settings would a user need to change?

Re: [slurm-users] how to check what slurm is doing when job pending with reason=none?

2021-06-16 Thread Gerhard Strangar
taleinterve...@sjtu.edu.cn wrote: > But after submit, this job still stay at PENDING state for about 30-60s and > during the pending time sacct shows the REASON is "None". It's the default sched_interval=60 in your slurm.conf. Gerhard

[slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Gerhard Strangar
Hello, how do you implement something like "drain host after 10 consecutive failed jobs"? Unlike a host check script, that checks for known errors, I'd like to stop killing jobs just because one node is faulty. Gerhard

[slurm-users] scancel the solver, not MPI

2020-10-02 Thread Gerhard Strangar
Hi, I'm wondering if it's possible to gracefully terminate a solver that is using MPI. If srun starts the MPI for me, can it tell the solver to terminate and then wait n seconds before it tells MPI to terminate? Or is the only way of handling this using scancel -b and trapping the signal?

Re: [slurm-users] Limit nodes of a partition without managing users

2020-08-17 Thread Gerhard Strangar
Brian Andrus wrote: > Most likely, but the specific approach depends on how you define what > you want. My idea was "high prio job is next unless are are too many of them". > For example, what if there are no jobs in high pri queue but many in > low? Should all the low ones run? Yes. > What s

Re: [slurm-users] [External] Limit nodes of a partition without managing users

2020-08-17 Thread Gerhard Strangar
Prentice Bisbal wrote: >> I'm wondering if it's possible to have slurm 19 run two partitions (low >> and high prio) that share all the nodes and limit the high prio >> partition in number of nodes used simultaneously without requiring to >> manage the users in the database. > Yes, you can do this u

[slurm-users] Limit nodes of a partition without managing users

2020-08-17 Thread Gerhard Strangar
Hello, I'm wondering if it's possible to have slurm 19 run two partitions (low and high prio) that share all the nodes and limit the high prio partition in number of nodes used simultaneously without requiring to manage the users in the database. Any ideas? Regards, Gerhard

Re: [slurm-users] Debugging communication problems

2020-08-06 Thread Gerhard Strangar
Gerhard Strangar wrote: > I'm experiencing a connectivity problem and I'm out of ideas, why this > is happening. I'm running a slurmctld on a multihomed host. > > (10.9.8.0/8) - master - (10.11.12.0/8) > There is no routing between these two subnets. My topolog

[slurm-users] Debugging communication problems

2020-08-04 Thread Gerhard Strangar
Hi, I'm experiencing a connectivity problem and I'm out of ideas, why this is happening. I'm running a slurmctld on a multihomed host. (10.9.8.0/8) - master - (10.11.12.0/8) There is no routing between these two subnets. So far, all slurmds resided in the first subnet and worked fine. I added so