Hello,
is it possible to change a pending job from --exclusive to
--exclusive=user? I tried scontrol update jobid=... oversubscribe=user,
but it seems to only accept yes or no.
Gerhard
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...
Gestió Servidors via slurm-users wrote:
> What I want is users could user all of them but simultaniously, a user only
> could use one of the RTX3080.
How about two partitions: One contains only the RTX3080, using the QoS
MaxTRESPerUser=gres/gpu=1 and another one with all the other GPUs not
havin
Hi,
I'm trying to figure out how to deal with a mix of few- and many-cpu
jobs. By that I mean most jobs use 128 cpus, but sometimes there are
jobs with only 16. As soon as that job with only 16 is running, the
scheduler splits the next 128 cpu jobs into 96+16 each, instead of
assigning a full 128
thomas.hartmann--- via slurm-users wrote:
> My idea was to basically have three partitions:
>
> 1. PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99]
> PriorityTier=100
> 2. PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50]
> PriorityTier=100
> 3. PartitionNam
Max Grönke via slurm-users wrote:
> (b) introduce a "small" partition for the <4h jobs with higher priority but
> we're unsure if this will block all the larger jobs to run...
Just limit the number of cpus in that partition.
Gerhard
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
Hello,
I'm wondering if there's a way to tell how much memory my job is using
per node. I'm doing
#SBATCH -n 256
srun solver inputfile
When I run sacct -o maxvmsize, the result apparently is the maxmimum VSZ
of the largest solver process, not the maximum of the sum of them all
(unlike when calli
Jeffrey Tunison wrote:
> Is there a straightforward way to create a batch job that runs once on every
> node in the cluster?
A wrapper around reboot configured as RebootProgram in slurm.conf?
Laurence Marks wrote:
> After some (irreproducible) time, often one of the three slow tasks hangs.
> A symptom is that if I try and ssh into the main node of the subtask (which
> is running 128 mpi on the 4 nodes) I get "Authentication failed".
How about asking an admin to check why it hangs?
Tim Wickberg wrote:
> A number of race conditions have been identified within the
> slurmd/slurmstepd processes that can lead to the user taking ownership
> of an arbitrary file on the system.
Is it any different than the CVE-2023-41915 in PMIx or does it just have
an additional number but it's
Alexander Grund wrote:
> Although it may be better to not drain it, I'm a bit nervous with "exit
> 0" as it is very important that the job does not start/continue, i.e.
> the user code (sbatch script/srun) is never executed in that case.
> So I want to be sure that an `scancel` on the job in its
Alexander Grund wrote:
> Our first approach with `scancel $SLURM_JOB_ID; exit 1` doesn't seem to
> work as the (sbatch) job still gets re-queued.
Try to exit with 0, because it's not your prolog that failed.
Amir Ben Avi wrote:
> I have looked on the Slurm documentation, but didn't found any way to crate
> resource dynamically ( in a script ) on the node level
Well, basically you could do something like
scontrol update nodename=$HOSTNAME Gres=myres:367. What you don't have
is decaying resource reser
Phil Chiu wrote:
>- Individual slurm jobs which reboot nodes - With a for loop, I could
>submit a reboot job for each node. But I'm not sure how to limit this so at
>most N jobs are running simultaneously.
With a fake license called reboot?
Roger Mason wrote:
> I would appreciate any suggestions on what might be causing this problem
> or what I can do to diagnose it.
Run getent hosts node012 on all hosts to see which one can't resolve it.
Eric F. Alemany wrote:
> Another solution would be the vNVIDIA GPU
> (Virtual GPU manager software).
> You can share GPU among VM’s
You can really *share* one, not just delegate one GPU to one VM?
Russell Jones wrote:
> I suppose I am confused about how GrpJobs works. The manual shows:
>
> The total number of jobs able to run at any given time from an association
> and its children QOS
>
>
> It is my understanding an association is cluster + account + user. Would
> this not just limit it
Russell Jones wrote:
> I am struggling to figure out how to do this. Any tips?
Create a QoS with GrpJobs=1 and assign it to the partition?
Joe Teumer wrote:
> However, if the user needs to reboot the node, set BIOS settings, etc then
> `salloc` automatically terminates the allocation when the new shell is
What kind of BIOS settings would a user need to change?
taleinterve...@sjtu.edu.cn wrote:
> But after submit, this job still stay at PENDING state for about 30-60s and
> during the pending time sacct shows the REASON is "None".
It's the default sched_interval=60 in your slurm.conf.
Gerhard
Hello,
how do you implement something like "drain host after 10 consecutive
failed jobs"? Unlike a host check script, that checks for known errors,
I'd like to stop killing jobs just because one node is faulty.
Gerhard
Hi,
I'm wondering if it's possible to gracefully terminate a solver that is
using MPI. If srun starts the MPI for me, can it tell the solver to
terminate and then wait n seconds before it tells MPI to terminate?
Or is the only way of handling this using scancel -b and trapping the
signal?
Brian Andrus wrote:
> Most likely, but the specific approach depends on how you define what
> you want.
My idea was "high prio job is next unless are are too many of them".
> For example, what if there are no jobs in high pri queue but many in
> low? Should all the low ones run?
Yes.
> What s
Prentice Bisbal wrote:
>> I'm wondering if it's possible to have slurm 19 run two partitions (low
>> and high prio) that share all the nodes and limit the high prio
>> partition in number of nodes used simultaneously without requiring to
>> manage the users in the database.
> Yes, you can do this u
Hello,
I'm wondering if it's possible to have slurm 19 run two partitions (low
and high prio) that share all the nodes and limit the high prio
partition in number of nodes used simultaneously without requiring to
manage the users in the database.
Any ideas?
Regards,
Gerhard
Gerhard Strangar wrote:
> I'm experiencing a connectivity problem and I'm out of ideas, why this
> is happening. I'm running a slurmctld on a multihomed host.
>
> (10.9.8.0/8) - master - (10.11.12.0/8)
> There is no routing between these two subnets.
My topolog
Hi,
I'm experiencing a connectivity problem and I'm out of ideas, why this
is happening. I'm running a slurmctld on a multihomed host.
(10.9.8.0/8) - master - (10.11.12.0/8)
There is no routing between these two subnets.
So far, all slurmds resided in the first subnet and worked fine. I added
so
26 matches
Mail list logo