Re: [slurm-users] Billing issue

2020-08-06 Thread Bas van der Vlies
Hi Diego, Yes this can be tricky we also use this feature. The billing is on partition level. so you can set different schemas. We have nodes with 16 cores and 96GB of ram and this are the cheapest nodes they cost in our model. 1 SBU (System Billing Unit). For this node we have the following

Re: [slurm-users] Billing issue

2020-08-06 Thread Diego Zuccato
Il 06/08/20 10:00, Bas van der Vlies ha scritto: Tks for the answer. > We have nodes with 16 cores and 96GB of ram and this are the cheapest nodes > they > cost in our model. Theoretical 6GB/core. 5.625 net. > We multiple everything by 1000 to avoid slurm's behaviour of truncating the > result

Re: [slurm-users] Billing issue

2020-08-06 Thread Bas van der Vlies
Il 06/08/20 10:00, Bas van der Vlies ha scritto: Tks for the answer. >> We have also node with GPU's (dfiferent types) and some cost more the others. > The partitions always have the same type of nodes not mixed,eg: > * > TRESBillingWeights=CPU=3801.0,Mem=502246.0T,GRES/gpu=22807.0,GRES/gpu:t

Re: [slurm-users] Billing issue

2020-08-06 Thread Diego Zuccato
Il 06/08/20 12:46, Bas van der Vlies ha scritto: > we have MAX(core, mem, gres). all resources can have the score: 91228 Ah, Ok. So you have PriorityFlags=MAX_TRES too. > So we take one of these maximum values we dived it again by 1000 and round > it. Hopefully this explains it. Yup, tks. Now I

Re: [slurm-users] Billing issue

2020-08-06 Thread Paul Raines
Bas Does that mean you are setting PriorityFlags=MAX_TRES ? Also does anyone understand this from the slurm.conf docs: The weighted amount of a resource can be adjusted by adding a suffix of K,M,G,T or P after the billing weight. For example, a memory weight of "mem=.25" on a job allocat

Re: [slurm-users] Billing issue

2020-08-06 Thread Bas van der Vlies
On Thu, 2020-08-06 at 09:30 -0400, Paul Raines wrote: > Bas > > Does that mean you are setting PriorityFlags=MAX_TRES ? > YES > Also does anyone understand this from the slurm.conf docs: > >The weighted amount of a resource can be adjusted by adding a suffix of >K,M,G,T or P after the b

Re: [slurm-users] Debugging communication problems

2020-08-06 Thread Gerhard Strangar
Gerhard Strangar wrote: > I'm experiencing a connectivity problem and I'm out of ideas, why this > is happening. I'm running a slurmctld on a multihomed host. > > (10.9.8.0/8) - master - (10.11.12.0/8) > There is no routing between these two subnets. My topology.conf contained a loop, which resu

[slurm-users] Reservation vs. Draining for Maintenance?

2020-08-06 Thread Jason Simms
Hello all, Later this month, I will have to bring down, patch, and reboot all nodes in our cluster for maintenance. The two options available to set nodes into a maintenance mode seem to be either: 1) creating a system-wide reservation, or 2) setting all nodes into a DRAIN state. I'm not sure it

Re: [slurm-users] Reservation vs. Draining for Maintenance?

2020-08-06 Thread Paul Edmon
Because we want to maximize usage we actually have opted to just cancel all running jobs the day of.  We send out notification to all the users that this will happen.  We haven't really seen any complaints and we've been doing this for years.  At the start of the outage we set all partitions to

Re: [slurm-users] Reservation vs. Draining for Maintenance?

2020-08-06 Thread Ole Holm Nielsen
On 06-08-2020 19:13, Jason Simms wrote: Later this month, I will have to bring down, patch, and reboot all nodes in our cluster for maintenance. The two options available to set nodes into a maintenance mode seem to be either: 1) creating a system-wide reservation, or 2) setting all nodes into

Re: [slurm-users] Reservation vs. Draining for Maintenance?

2020-08-06 Thread Ing. Gonzalo E. Arroyo
When I need to do something like this I let the automatic SLURM management to do the job. I only shutdown by using SSH, replace something, then power on and everything starts Ok, other option is to call resume in case of any failure, and restart the slurm services in nodes... Regards *Ing. Gonzalo

Re: [slurm-users] Reservation vs. Draining for Maintenance?

2020-08-06 Thread Thomas M. Payerle
We usually we set up a reservation for maintenance. This prevents jobs from starting if they are not expected to end before the reservation (maintenance) starts. As Paul indicated, this causes nodes to become idle (and pending job queue to grow) as maintenance time approaches, but avoids requiring

[slurm-users] Compute node OS and firmware updates

2020-08-06 Thread Ole Holm Nielsen
Regarding the question of methods for Slurm compute node OS and firmware updates, we have for a long time used rolling updates while the cluster is in full production, so that we do not waste any resources. When entire partitions are upgraded in this way, there is no risk of starting new jobs

Re: [slurm-users] Reservation vs. Draining for Maintenance?

2020-08-06 Thread Christopher Samuel
On 8/6/20 10:13 am, Jason Simms wrote: Later this month, I will have to bring down, patch, and reboot all nodes in our cluster for maintenance. The two options available to set nodes into a maintenance mode seem to be either: 1) creating a system-wide reservation, or 2) setting all nodes into

[slurm-users] Tuning MaxJobs and MaxJobsSubmit per user and for the whole cluster?

2020-08-06 Thread Hoyle, Alan P
I can't find any advice online about how to tune things like MaxJobs on a per-cluster or per-user basis. As far as I can tell, it seems that the default install cluster MaxJobs seems to be 10,000 and MaxSubmit as the same. Those seem pretty low to me: are there resources that get consumed if

Re: [slurm-users] Slurmstepd errors

2020-08-06 Thread Williams, Jenny Avis
We ran into a similar error -- A response from schedmd: https://bugs.schedmd.com/show_bug.cgi?id=3890 Remediating steps until updates got us past this particular issue: Check for "xcgroup_instantiate errors” and close nodes that show this in messages log. From the nodes listed here we close com

Re: [slurm-users] Correct way to give srun and sbatch different MaxTime values?

2020-08-06 Thread Jaekyeom Kim
Thank you for the answer. I wasn't aware of that file. I'll look into it! Best, Jaekyeom On Wed, Aug 5, 2020 at 3:27 AM Renfro, Michael wrote: > Untested, but you should be able to use a job_submit.lua file to detect if > the job was started with srun or sbatch: > >- Check with (job_desc.s