Hi,
we're testing possible slurm configurations on a test system right now.
Eventually, it is going to serve ~1000 users.
We're going to have some users who are going to run lots of short jobs (a
couple of minutes to ~4h) and some users that run jobs that are going to run
for days or weeks. I w
Thank you for the response, it certainly clears up a few things, and the
list of required packages is super helpful (where are these listed in the
docs?).
Here are a few follow up questions:
I had installed Slurm (version 22.05) using apt by running 'apt install
slurm-wlm'. Is it necessary to exe
Hi Thomas,
"thomas.hartmann--- via slurm-users"
writes:
> Hi,
> we're testing possible slurm configurations on a test system right now.
> Eventually, it is going to serve ~1000 users.
>
> We're going to have some users who are going to run lots of short jobs
> (a couple of minutes to ~4h) and s
Hi,
On 04.04.24 04:46, Toshiki Sonoda (Fujitsu) via slurm-users wrote:
We set up scrun (slurm 23.11.5) integrated with rootless podman,
I'd recommend looking into nvidia enroot instead.
https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf
MfG
--
Markus Kötter, +49 681 870832
Thanks! I realized I made a mistake and had it still talking to an older
slurmdbd system.
On Wed, Apr 3, 2024 at 1:54 PM Timo Rothenpieler via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> On 02.04.2024 22:15, Russell Jones via slurm-users wrote:
> > Hi all,
> >
> > I am working on upgrad
I am writing to seek assistance with a critical issue on our single-node
system managed by Slurm. Our jobs are queued and marked as awaiting
resources, but they are not starting despite seeming availability. I'm new
with SLURM and my only experience was a class on installing it so I have no
experie
What does “scontrol show node cusco” and “scontrol show job PENDING_JOB_ID”
show?
On one job we currently have that’s pending due to Resources, that job has
requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but the
node it wants to run on only has 37 CPUs available (seen by
Yep, from your scontrol show node output:
CfgTRES=cpu=64,mem=2052077M,billing=64
AllocTRES=cpu=1,mem=2052077M
The running job (77) has allocated 1 CPU and all the memory on the node. That’s
probably due to the partition using the default DefMemPerCPU value [1], which
is unlimited.
Since all ou
Le 04/04/2024 à 03:33, Loris Bennett via slurm-users a écrit :
I have never really understood the approach of having different
partitions for different lengths of job, but it seems to be quite
widespread, so I assume there are valid use cases.
However, for our around 450 users, of which about 20
thomas.hartmann--- via slurm-users wrote:
> My idea was to basically have three partitions:
>
> 1. PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99]
> PriorityTier=100
> 2. PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50]
> PriorityTier=100
> 3. PartitionNam
Thank you That was the issue, I'm so happy :-) sending you many thanks.
On Thu, Apr 4, 2024 at 10:11 AM Renfro, Michael wrote:
> Yep, from your scontrol show node output:
>
> CfgTRES=cpu=64,mem=2052077M,billing=64
> AllocTRES=cpu=1,mem=2052077M
>
>
>
> The running job (77) has allocated 1 CP
Hi,
I'm currently testing an approach similar to the example by Loris.
Why consider preemption? Because, in the original example, if the cluster is
saturated by long running jobs (like 2 weeks), there should be the possibility
to run short jobs right away.
Best,
Thomas
--
slurm-users mailing
12 matches
Mail list logo