On 7/2/21 7:34 AM, Jack Chen wrote:
Slurm is great to use, I've developed several plugins on it. Now I'm
working on an issue in slurm.
I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task
is killed after a few hours. This can be reproduced several times. After
turning of
On 7/1/21 7:08 am, Brian Andrus wrote:
I have a partition where one of the nodes has a node-locked license.
That license is not used by everyone that uses the partition.
This might be a case for using a reservation on that node with the
MaxStartDelay flag to set the maximum amount of time (in
Hi Tina,
Tina Friedrich writes:
> Hi Brian,
>
> sometimes it would be nice if SLURM had what Grid Engine calls a 'forced
> complex' (i.e. a feature that you *have* to request to land on a node that has
> it), wouldn't it?
>
> I do something like that for all of my 'special' nodes (GPU, KNL, node
Slurm is great to use, I've developed several plugins on it. Now I'm
working on an issue in slurm.
I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task
is killed after a few hours. This can be reproduced several times. After
turning off cgroup, it disappears.
Linux kernel: 3
Hi,
we tried it out it on Google Cloud with GPU nodes running on another provider
through site-to-site VPN. The database was on a managed GCloud instance.
There are indeed points that you need to consider:
- Microservice: the maximalist dream "1 process = 1 container" is not possible
for slurmc
On 7/1/21 3:26 pm, Sid Young wrote:
I have exactly the same issue with a user who needs the reported cores
to reflect the requested cores. If you find a solution that works please
share. :)
The number of CPUs in teh system vs the number of CPUs you can access
are very different things. You c
We are pleased to announce the availability of Slurm version 20.11.8.
This includes a number of minor-to-moderate severity bug fixes.
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
- Tim
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development an
Hi Luis,
I have exactly the same issue with a user who needs the reported cores to
reflect the requested cores. If you find a solution that works please
share. :)
Thanks
Sid Young
Translational Research Institute
Sid Young
W: https://off-grid-engineering.com
W: (personal) https://sidyoung
Hi Folks,
Thank you for your responses, I wrote the following configuration in
cgroup.conf along the appropriate slurm.conf
changes and I wrote a program to verify affinity whe queued or running in
the cluster. results are below. Thanks so much.
###
#
# Slurm cgroup support configuration file
Hello,
I am investigating Slurm's ability to do requeuing of jobs. I like the
fact that I can set RequeueExit= in the slurm.conf file,
since this will automatically requeue jobs that exit with the specified
exit codes. But, is there a way to limit the # of requeues?
Thanks
David
Lyn,
Yeah, I think this is it. Looks similar to what Tina has in place too.
So, we set all the nodes as either "FEATURE" or "NOFEATURE" and in
job_submit.lua set it to 'NOFEATURE' if it is not set.
Sound like what you are doing?
I may need some hints on what to specifically set in the lua sc
I'm not sure. I just installed Arbiter myself only a few weeks ago, and
I'm still learning it. The systems it's installed on haven't gone live
yet, so I haven't had many "learning opportunities" yet. Arbiter is
using cgroups, so I would imagine that depends on whether cgroups
distinguishes betw
Brian,
Would a reservation on that node work? I think you could even do a
combination of MAGNETIC and features in the reservation itself if you
wanted to minimize hassle, though that probably doesn't add much beyond
just requiring that the reservation name be specified by people who want
to
Unfortunately we will not be holding SLUG'21 in person.
We expect to have a virtual event again this year on Tuesday, September
21st. I'll have more details as we get closer to that date.
- Tim
On 7/1/21 8:07 AM, Paul Brunk wrote:
Hi:
It's that time again...we're doing travel budget plannin
Hi Brian,
sometimes it would be nice if SLURM had what Grid Engine calls a 'forced
complex' (i.e. a feature that you *have* to request to land on a node
that has it), wouldn't it?
I do something like that for all of my 'special' nodes (GPU, KNL,
nodes...) - I want to avoid jobs not requestin
Hey, Brian,
Neither I nor you are going to like what I'm about to say (but I think it's
where you're headed). :)
We have an equivalent use case, where we're trying to keep long work off of
a certain number of nodes. Since we already have used "long" as a QoS name,
to keep from overloading "long,"
Hi,
I'm trying to prepare for using Slurm with DGX A100 systems with MIG
configuration. I will have several gres:gpu types there so I tried to
reproduce the situation described in "Specific limits over GRES" from
https://slurm.schedmd.com/resource_limits.html, but I can't.
In my test environ
All,
I have a partition where one of the nodes has a node-locked license.
That license is not used by everyone that uses the partition.
They are cloud nodes, so weights do not work (there is an open bug about
that).
I need to have jobs 'avoid' that node by default. I am thinking I can
use a f
Hi:
It's that time again...we're doing travel budget planning. Do we have
a sense of whether or how there will be a user group meeting this
year? I saw the April poll.
Thanks!
--
Grinning like an idiot,
Paul Brunk, system administrator
Georgia Advanced Computing Resource Center (GACRC)
Enterpr
Ok.
You may want to check your slurmdbd host(s) and ensure the users are
known there. If it does not know who a user is, it will not allow access
to the data.
If you are running sssd, clear the cache and such too.
Brian Andrus
On 7/1/2021 1:12 AM, taleinterve...@sjtu.edu.cn wrote:
I can
I can make sure the test job is running (of course in the default time
window) when doing sacct query, and here is the new test record which
describe it more clearly:
[2021-07-01T16:02:42+0800][hpczty@cas013] ~/downloads> sbatch testjob.sh
Submitted batch job 6955371
[2021-07-01T16:02:48+0
21 matches
Mail list logo