Re: [slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

2021-07-01 Thread Ole Holm Nielsen
On 7/2/21 7:34 AM, Jack Chen wrote: Slurm is great to use, I've developed several plugins on it. Now I'm working on an issue in slurm. I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task is killed after a few hours. This can be reproduced several times. After turning of

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Christopher Samuel
On 7/1/21 7:08 am, Brian Andrus wrote: I have a partition where one of the nodes has a node-locked license. That license is not used by everyone that uses the partition. This might be a case for using a reservation on that node with the MaxStartDelay flag to set the maximum amount of time (in

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Loris Bennett
Hi Tina, Tina Friedrich writes: > Hi Brian, > > sometimes it would be nice if SLURM had what Grid Engine calls a 'forced > complex' (i.e. a feature that you *have* to request to land on a node that has > it), wouldn't it? > > I do something like that for all of my 'special' nodes (GPU, KNL, node

[slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

2021-07-01 Thread Jack Chen
Slurm is great to use, I've developed several plugins on it. Now I'm working on an issue in slurm. I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task is killed after a few hours. This can be reproduced several times. After turning off cgroup, it disappears. Linux kernel: 3

Re: [slurm-users] Is anyone running the slurmctld and slurmdbd services from within a container?

2021-07-01 Thread slurm-maillist
Hi, we tried it out it on Google Cloud with GPU nodes running on another provider through site-to-site VPN. The database was on a managed GCloud instance. There are indeed points that you need to consider: - Microservice: the maximalist dream "1 process = 1 container" is not possible for slurmc

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

2021-07-01 Thread Christopher Samuel
On 7/1/21 3:26 pm, Sid Young wrote: I have exactly the same issue with a user who needs the reported cores to reflect the requested cores. If you find a solution that works please share. :) The number of CPUs in teh system vs the number of CPUs you can access are very different things. You c

[slurm-users] Slurm version 20.11.8 is now available

2021-07-01 Thread Tim Wickberg
We are pleased to announce the availability of Slurm version 20.11.8. This includes a number of minor-to-moderate severity bug fixes. Slurm can be downloaded from https://www.schedmd.com/downloads.php . - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development an

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

2021-07-01 Thread Sid Young
Hi Luis, I have exactly the same issue with a user who needs the reported cores to reflect the requested cores. If you find a solution that works please share. :) Thanks Sid Young Translational Research Institute Sid Young W: https://off-grid-engineering.com W: (personal) https://sidyoung

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

2021-07-01 Thread Luis R. Torres
Hi Folks, Thank you for your responses, I wrote the following configuration in cgroup.conf along the appropriate slurm.conf changes and I wrote a program to verify affinity whe queued or running in the cluster. results are below. Thanks so much. ### # # Slurm cgroup support configuration file

[slurm-users] When using RequeueExit in Slurm.conf, can you limit the # of requeues?

2021-07-01 Thread David Henkemeyer
Hello, I am investigating Slurm's ability to do requeuing of jobs. I like the fact that I can set RequeueExit= in the slurm.conf file, since this will automatically requeue jobs that exit with the specified exit codes. But, is there a way to limit the # of requeues? Thanks David

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Brian Andrus
Lyn, Yeah, I think this is it. Looks similar to what Tina has in place too. So, we set all the nodes as either "FEATURE" or "NOFEATURE" and in job_submit.lua set it to 'NOFEATURE' if it is not set. Sound like what you are doing? I may need some hints on what to specifically set in the lua sc

Re: [slurm-users] [External] What is an easy way to prevent users run programs on the master/login node.

2021-07-01 Thread Prentice Bisbal
I'm not sure. I just installed Arbiter myself only a few weeks ago, and I'm still learning it. The systems it's installed on haven't gone live yet, so I haven't had many "learning opportunities" yet. Arbiter is using cgroups, so I would imagine that depends on whether cgroups distinguishes betw

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Ryan Cox
Brian, Would a reservation on that node work?  I think you could even do a combination of MAGNETIC and features in the reservation itself if you wanted to minimize hassle, though that probably doesn't add much beyond just requiring that the reservation name be specified by people who want to

Re: [slurm-users] SLUG '21

2021-07-01 Thread Tim Wickberg
Unfortunately we will not be holding SLUG'21 in person. We expect to have a virtual event again this year on Tuesday, September 21st. I'll have more details as we get closer to that date. - Tim On 7/1/21 8:07 AM, Paul Brunk wrote: Hi: It's that time again...we're doing travel budget plannin

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Tina Friedrich
Hi Brian, sometimes it would be nice if SLURM had what Grid Engine calls a 'forced complex' (i.e. a feature that you *have* to request to land on a node that has it), wouldn't it? I do something like that for all of my 'special' nodes (GPU, KNL, nodes...) - I want to avoid jobs not requestin

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Lyn Gerner
Hey, Brian, Neither I nor you are going to like what I'm about to say (but I think it's where you're headed). :) We have an equivalent use case, where we're trying to keep long work off of a certain number of nodes. Since we already have used "long" as a QoS name, to keep from overloading "long,"

[slurm-users] Specific limits over GRES - still relevant?

2021-07-01 Thread Matthias Leopold
Hi, I'm trying to prepare for using Slurm with DGX A100 systems with MIG configuration. I will have several gres:gpu types there so I tried to reproduce the situation described in "Specific limits over GRES" from https://slurm.schedmd.com/resource_limits.html, but I can't. In my test environ

[slurm-users] How to avoid a feature?

2021-07-01 Thread Brian Andrus
All, I have a partition where one of the nodes has a node-locked license. That license is not used by everyone that uses the partition. They are cloud nodes, so weights do not work (there is an open bug about that). I need to have jobs 'avoid' that node by default. I am thinking I can use a f

[slurm-users] SLUG '21

2021-07-01 Thread Paul Brunk
Hi: It's that time again...we're doing travel budget planning. Do we have a sense of whether or how there will be a user group meeting this year? I saw the April poll. Thanks! -- Grinning like an idiot, Paul Brunk, system administrator Georgia Advanced Computing Resource Center (GACRC) Enterpr

Re: [slurm-users] 答复: Is there bug in PrivateData=jobs option of slurmdbd?

2021-07-01 Thread Brian Andrus
Ok. You may want to check your slurmdbd host(s) and ensure the users are known there. If it does not know who a user is, it will not allow access to the data. If you are running sssd, clear the cache and such too. Brian Andrus On 7/1/2021 1:12 AM, taleinterve...@sjtu.edu.cn wrote: I can

[slurm-users] 答复: Is there bug in PrivateData=jobs option of slurmdbd?

2021-07-01 Thread taleintervenor
I can make sure the test job is running (of course in the default time window) when doing sacct query, and here is the new test record which describe it more clearly: [2021-07-01T16:02:42+0800][hpczty@cas013] ~/downloads> sbatch testjob.sh Submitted batch job 6955371 [2021-07-01T16:02:48+0