date:20210701

Re: [slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

2021-07-01 Thread Ole Holm Nielsen

On 7/2/21 7:34 AM, Jack Chen wrote: Slurm is great to use, I've developed several plugins on it. Now I'm working on an issue in slurm. I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task is killed after a few hours. This can be reproduced several times. After turning of

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Christopher Samuel

On 7/1/21 7:08 am, Brian Andrus wrote: I have a partition where one of the nodes has a node-locked license. That license is not used by everyone that uses the partition. This might be a case for using a reservation on that node with the MaxStartDelay flag to set the maximum amount of time (in

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Loris Bennett

Hi Tina, Tina Friedrich writes: > Hi Brian, > > sometimes it would be nice if SLURM had what Grid Engine calls a 'forced > complex' (i.e. a feature that you *have* to request to land on a node that has > it), wouldn't it? > > I do something like that for all of my 'special' nodes (GPU, KNL, node

[slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

2021-07-01 Thread Jack Chen

Slurm is great to use, I've developed several plugins on it. Now I'm working on an issue in slurm. I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task is killed after a few hours. This can be reproduced several times. After turning off cgroup, it disappears. Linux kernel: 3

Re: [slurm-users] Is anyone running the slurmctld and slurmdbd services from within a container?

2021-07-01 Thread slurm-maillist

Hi, we tried it out it on Google Cloud with GPU nodes running on another provider through site-to-site VPN. The database was on a managed GCloud instance. There are indeed points that you need to consider: - Microservice: the maximalist dream "1 process = 1 container" is not possible for slurmc

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

2021-07-01 Thread Christopher Samuel

On 7/1/21 3:26 pm, Sid Young wrote: I have exactly the same issue with a user who needs the reported cores to reflect the requested cores. If you find a solution that works please share. :) The number of CPUs in teh system vs the number of CPUs you can access are very different things. You c

[slurm-users] Slurm version 20.11.8 is now available

2021-07-01 Thread Tim Wickberg

We are pleased to announce the availability of Slurm version 20.11.8. This includes a number of minor-to-moderate severity bug fixes. Slurm can be downloaded from https://www.schedmd.com/downloads.php . - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development an

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

2021-07-01 Thread Sid Young

Hi Luis, I have exactly the same issue with a user who needs the reported cores to reflect the requested cores. If you find a solution that works please share. :) Thanks Sid Young Translational Research Institute Sid Young W: https://off-grid-engineering.com W: (personal) https://sidyoung

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

2021-07-01 Thread Luis R. Torres

Hi Folks, Thank you for your responses, I wrote the following configuration in cgroup.conf along the appropriate slurm.conf changes and I wrote a program to verify affinity whe queued or running in the cluster. results are below. Thanks so much. ### # # Slurm cgroup support configuration file

[slurm-users] When using RequeueExit in Slurm.conf, can you limit the # of requeues?

2021-07-01 Thread David Henkemeyer

Hello, I am investigating Slurm's ability to do requeuing of jobs. I like the fact that I can set RequeueExit= in the slurm.conf file, since this will automatically requeue jobs that exit with the specified exit codes. But, is there a way to limit the # of requeues? Thanks David

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Brian Andrus

Lyn, Yeah, I think this is it. Looks similar to what Tina has in place too. So, we set all the nodes as either "FEATURE" or "NOFEATURE" and in job_submit.lua set it to 'NOFEATURE' if it is not set. Sound like what you are doing? I may need some hints on what to specifically set in the lua sc

Re: [slurm-users] [External] What is an easy way to prevent users run programs on the master/login node.

2021-07-01 Thread Prentice Bisbal

I'm not sure. I just installed Arbiter myself only a few weeks ago, and I'm still learning it. The systems it's installed on haven't gone live yet, so I haven't had many "learning opportunities" yet. Arbiter is using cgroups, so I would imagine that depends on whether cgroups distinguishes betw

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Ryan Cox

Brian, Would a reservation on that node work? I think you could even do a combination of MAGNETIC and features in the reservation itself if you wanted to minimize hassle, though that probably doesn't add much beyond just requiring that the reservation name be specified by people who want to

Re: [slurm-users] SLUG '21

2021-07-01 Thread Tim Wickberg

Unfortunately we will not be holding SLUG'21 in person. We expect to have a virtual event again this year on Tuesday, September 21st. I'll have more details as we get closer to that date. - Tim On 7/1/21 8:07 AM, Paul Brunk wrote: Hi: It's that time again...we're doing travel budget plannin

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Tina Friedrich

Hi Brian, sometimes it would be nice if SLURM had what Grid Engine calls a 'forced complex' (i.e. a feature that you *have* to request to land on a node that has it), wouldn't it? I do something like that for all of my 'special' nodes (GPU, KNL, nodes...) - I want to avoid jobs not requestin

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Lyn Gerner

Hey, Brian, Neither I nor you are going to like what I'm about to say (but I think it's where you're headed). :) We have an equivalent use case, where we're trying to keep long work off of a certain number of nodes. Since we already have used "long" as a QoS name, to keep from overloading "long,"

[slurm-users] Specific limits over GRES - still relevant?

2021-07-01 Thread Matthias Leopold

Hi, I'm trying to prepare for using Slurm with DGX A100 systems with MIG configuration. I will have several gres:gpu types there so I tried to reproduce the situation described in "Specific limits over GRES" from https://slurm.schedmd.com/resource_limits.html, but I can't. In my test environ

[slurm-users] How to avoid a feature?

2021-07-01 Thread Brian Andrus

All, I have a partition where one of the nodes has a node-locked license. That license is not used by everyone that uses the partition. They are cloud nodes, so weights do not work (there is an open bug about that). I need to have jobs 'avoid' that node by default. I am thinking I can use a f

[slurm-users] SLUG '21

2021-07-01 Thread Paul Brunk

Hi: It's that time again...we're doing travel budget planning. Do we have a sense of whether or how there will be a user group meeting this year? I saw the April poll. Thanks! -- Grinning like an idiot, Paul Brunk, system administrator Georgia Advanced Computing Resource Center (GACRC) Enterpr

Re: [slurm-users] 答复: Is there bug in PrivateData=jobs option of slurmdbd?

2021-07-01 Thread Brian Andrus

Ok. You may want to check your slurmdbd host(s) and ensure the users are known there. If it does not know who a user is, it will not allow access to the data. If you are running sssd, clear the cache and such too. Brian Andrus On 7/1/2021 1:12 AM, taleinterve...@sjtu.edu.cn wrote: I can

[slurm-users] 答复: Is there bug in PrivateData=jobs option of slurmdbd?

2021-07-01 Thread taleintervenor

I can make sure the test job is running (of course in the default time window) when doing sacct query, and here is the new test record which describe it more clearly: [2021-07-01T16:02:42+0800][hpczty@cas013] ~/downloads> sbatch testjob.sh Submitted batch job 6955371 [2021-07-01T16:02:48+0

Re: [slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

Re: [slurm-users] How to avoid a feature?

Re: [slurm-users] How to avoid a feature?

[slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

Re: [slurm-users] Is anyone running the slurmctld and slurmdbd services from within a container?

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

[slurm-users] Slurm version 20.11.8 is now available

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

[slurm-users] When using RequeueExit in Slurm.conf, can you limit the # of requeues?

Re: [slurm-users] How to avoid a feature?

Re: [slurm-users] [External] What is an easy way to prevent users run programs on the master/login node.

Re: [slurm-users] How to avoid a feature?

Re: [slurm-users] SLUG '21

Re: [slurm-users] How to avoid a feature?

Re: [slurm-users] How to avoid a feature?

[slurm-users] Specific limits over GRES - still relevant?

[slurm-users] How to avoid a feature?

[slurm-users] SLUG '21

Re: [slurm-users] 答复: Is there bug in PrivateData=jobs option of slurmdbd?

[slurm-users] 答复: Is there bug in PrivateData=jobs option of slurmdbd?

21 matches

Site Navigation

Mail list logo

Footer information