Hi Diego,
Yes this can be tricky we also use this feature. The billing is on partition
level. so you can set different
schemas.
We have nodes with 16 cores and 96GB of ram and this are the cheapest nodes they
cost in our model.
1 SBU (System Billing Unit). For this node we have the following
Il 06/08/20 10:00, Bas van der Vlies ha scritto:
Tks for the answer.
> We have nodes with 16 cores and 96GB of ram and this are the cheapest nodes
> they
> cost in our model.
Theoretical 6GB/core. 5.625 net.
> We multiple everything by 1000 to avoid slurm's behaviour of truncating the
> result
Il 06/08/20 10:00, Bas van der Vlies ha scritto:
Tks for the answer.
>> We have also node with GPU's (dfiferent types) and some cost more the others.
> The partitions always have the same type of nodes not mixed,eg:
> *
> TRESBillingWeights=CPU=3801.0,Mem=502246.0T,GRES/gpu=22807.0,GRES/gpu:t
Il 06/08/20 12:46, Bas van der Vlies ha scritto:
> we have MAX(core, mem, gres). all resources can have the score: 91228
Ah, Ok. So you have
PriorityFlags=MAX_TRES
too.
> So we take one of these maximum values we dived it again by 1000 and round
> it. Hopefully this explains it.
Yup, tks.
Now I
Bas
Does that mean you are setting PriorityFlags=MAX_TRES ?
Also does anyone understand this from the slurm.conf docs:
The weighted amount of a resource can be adjusted by adding a suffix of
K,M,G,T or P after the billing weight. For example, a memory weight of
"mem=.25" on a job allocat
On Thu, 2020-08-06 at 09:30 -0400, Paul Raines wrote:
> Bas
>
> Does that mean you are setting PriorityFlags=MAX_TRES ?
>
YES
> Also does anyone understand this from the slurm.conf docs:
>
>The weighted amount of a resource can be adjusted by adding a suffix of
>K,M,G,T or P after the b
Gerhard Strangar wrote:
> I'm experiencing a connectivity problem and I'm out of ideas, why this
> is happening. I'm running a slurmctld on a multihomed host.
>
> (10.9.8.0/8) - master - (10.11.12.0/8)
> There is no routing between these two subnets.
My topology.conf contained a loop, which resu
Hello all,
Later this month, I will have to bring down, patch, and reboot all nodes in
our cluster for maintenance. The two options available to set nodes into a
maintenance mode seem to be either: 1) creating a system-wide reservation,
or 2) setting all nodes into a DRAIN state.
I'm not sure it
Because we want to maximize usage we actually have opted to just cancel
all running jobs the day of. We send out notification to all the users
that this will happen. We haven't really seen any complaints and we've
been doing this for years. At the start of the outage we set all
partitions to
On 06-08-2020 19:13, Jason Simms wrote:
Later this month, I will have to bring down, patch, and reboot all nodes
in our cluster for maintenance. The two options available to set nodes
into a maintenance mode seem to be either: 1) creating a system-wide
reservation, or 2) setting all nodes into
When I need to do something like this I let the automatic SLURM management
to do the job. I only shutdown by using SSH, replace something, then power
on and everything starts Ok, other option is to call resume in case of any
failure, and restart the slurm services in nodes... Regards
*Ing. Gonzalo
We usually we set up a reservation for maintenance. This prevents jobs
from starting if they are not expected to end before the reservation
(maintenance) starts.
As Paul indicated, this causes nodes to become idle (and pending job queue
to grow) as maintenance time approaches, but avoids requiring
Regarding the question of methods for Slurm compute node OS and firmware
updates, we have for a long time used rolling updates while the cluster
is in full production, so that we do not waste any resources. When
entire partitions are upgraded in this way, there is no risk of starting
new jobs
On 8/6/20 10:13 am, Jason Simms wrote:
Later this month, I will have to bring down, patch, and reboot all nodes
in our cluster for maintenance. The two options available to set nodes
into a maintenance mode seem to be either: 1) creating a system-wide
reservation, or 2) setting all nodes into
I can't find any advice online about how to tune things like MaxJobs on a
per-cluster or per-user basis.
As far as I can tell, it seems that the default install cluster MaxJobs seems
to be 10,000 and MaxSubmit as the same. Those seem pretty low to me: are
there resources that get consumed if
We ran into a similar error --
A response from schedmd:
https://bugs.schedmd.com/show_bug.cgi?id=3890
Remediating steps until updates got us past this particular issue:
Check for "xcgroup_instantiate errors” and close nodes that show this in
messages log. From the nodes listed here we close com
Thank you for the answer.
I wasn't aware of that file.
I'll look into it!
Best,
Jaekyeom
On Wed, Aug 5, 2020 at 3:27 AM Renfro, Michael wrote:
> Untested, but you should be able to use a job_submit.lua file to detect if
> the job was started with srun or sbatch:
>
>- Check with (job_desc.s
17 matches
Mail list logo