[slurm-dev] Re: Finding job command after fails

2017-10-18 Thread Marcin Stolarek
Some time ago we've been using slurmctl prologue for this. 2017-10-16 16:36 GMT+02:00 Ryan Richholt : > Thanks, that sounds like a good idea. A prolog script could also handle > this right? That way if the node crashes while the job is running, it would > still be saved. > > On Mon, Oct 16, 2017

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-18 Thread Benjamin LIPERE
Sorry, bad Phone typo Le 18 oct. 2017 08:07, "Benjamin LIPERE" a écrit : Wellington, for security, first wrong starting. HPC not secure. Except if you have à 10pers team. I hope that at list you put thé cluster behind a router firewall in à militarisation zone. If you d'idées not second score in

[slurm-dev] how is slurm calculating memory TRES

2017-10-18 Thread Ilja Livenson
Hello, I have a noob question regarding the accounting in SLURM. In particular, I'm trying to figure out how is memory TRES accounting done in SLURM. Concrete case: A user has submitted 2 short jobs under a certain account. Now I want to get what has happened in the account with sreport. While c

[slurm-dev] node selection

2017-10-18 Thread Michael Di Domenico
is there anyway after a job starts to determine why the scheduler choose the series of nodes it did? for some reason on an empty cluster when i spin up a large job it's staggering the allocation across a seemingly random allocation of nodes we're using backfill/cons_res + gres, and all the nodes

[slurm-dev] Re: mysql job_table and step_table growth

2017-10-18 Thread Douglas Meyer
Thank you in advance. [2017-07-02T00:00:08.700] Warning: Note very large processing time from daily_rollup for slurmhpc: usec=7346971 began=00:00:01.353 [2017-07-03T00:00:08.130] Warning: Note very large processing time from daily_rollup for slurmhpc: usec=7368223 began=00:00:00.762 [2017-07-04T

[slurm-dev] "unrecognized key: OverSubscribe" for partition

2017-10-18 Thread Christian Leitold
Hello, I am running a small cluster, and recently we wanted to enable the OverSubscribe option for the default partition in order to allow jobs to share a node, as described here: https://slurm.schedmd.com/cons_res_share.html However, when I try to enable the option, I get an error message: sco

[slurm-dev] Re: "unrecognized key: OverSubscribe" for partition

2017-10-18 Thread Benjamin Redling
Hello Christian, Am 18.10.2017 um 21:26 schrieb Christian Leitold: > I am running a small cluster, and recently we wanted to enable > the OverSubscribe option for the default partition in order to allow > jobs to share a node, as described here: oversubscribe fka. shared > SelectType=select/lin

[slurm-dev] Re: mysql job_table and step_table growth

2017-10-18 Thread Christopher Samuel
On 19/10/17 05:24, Douglas Meyer wrote: > We have job_table purge set for 61 days and step_table for 11. Seems > to have no impact. So you have this in slurmdbd.conf? PurgeJobAfter=61days PurgeStepAfter=11days Anything in the logs when you start up slurmdbd? What does this say? sacctmgr lis

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-18 Thread Nadav Toledo
Re: [slurm-dev] Re: Qos limits associations and AD auth Hey Benjamin, I am sorry english is not my mother language, so I barely understand what you wrote can you explain when you have more time? Thanks, Nadav On 18/10/2017 17:59, Benjamin LIPERE wrote: Sorry, bad Phone typo

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-18 Thread Christopher Samuel
On 18/10/17 16:27, Nadav Toledo wrote: > about B:ן¿½ The reason is I dont want to manually adding each user to > the slurm database (sacctmgr create user...) I'm afraid you don't really have an option there, if you want to use the slurmdbd limits then you're going to need to add the users to the

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-18 Thread Nadav Toledo
Hey chris, Problem is, even adding a user : sacctmgr create user domain_name\\user_name account=research restarting slurmctrld and trying to run a job with the above user : srun bash, resulting in: slurmctld: error: User 243309139 not found slurmctld: _job_create: invalid acc