...ok... sure I had no idea where the "parent" label came from. This
makes perfect sense. It will default to "1", I think.
On Sat, Aug 10, 2024 at 12:24 PM Ryan Cox wrote:
> fairshare=parent sets the user association to effectively compete at the
> account level, so this is behaving as inten
...and there's not actually one account in your setup, is there? There
should at least be a "root" and a "mic" account, I think.
I don't recall whether you'd sent the output of "sshare | head -15"...
On Sat, Aug 10, 2024 at 2:30 PM Fulcomer, Sa
We use the following relevant settings...
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityCalcPeriod=00:02:00
PriorityMaxAge=3-0
PriorityWeightAge=0
PriorityWeightFairshare=200
PriorityWeightJobSize=1
PriorityWeightPartition=200
PriorityWeightQOS=100
PriorityWeightTRES=
uot;sshare" (no arguments) would be
useful for me.
On Fri, Aug 9, 2024 at 9:52 PM Drucker, Daniel
wrote:
> On Aug 9, 2024, at 9:21 PM, Fulcomer, Samuel
> wrote:
> > And note that the high PriorityWeightAge may be complicating things. We
> set it to 0. With it set so high, it
ing to run.
On Fri, Aug 9, 2024 at 9:15 PM Fulcomer, Samuel
wrote:
> ...and what are the top 10-15 lines in your share output?...
>
> On Fri, Aug 9, 2024 at 9:07 PM Drucker, Daniel <
> ddruc...@mclean.harvard.edu> wrote:
>
>> PriorityType=priority/multifactor
>> P
WeightQOS=0
>
> In 21.08.8.
>
>
>
> On Aug 9, 2024, at 8:36 PM, Fulcomer, Samuel
> wrote:
>
> External Email - Use Caution
>
> Yes, well, in that case, it should work as you desire, modulo your
> slurm.conf settings. What are the relevant lines in yours?
>
&
t's say user A has completed a million jobs in the last few days
> as well, and user A has never submitted any before.
>
> On Aug 9, 2024, at 6:03 PM, Fulcomer, Samuel
> wrote:
>
> External Email - Use Caution
>
> I don't think fairshare use is updated unt
I don't think fairshare use is updated until jobs finish...
On Fri, Aug 9, 2024 at 5:59 PM Drucker, Daniel via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> Hi Paul from over at mclean.harvard.edu!
>
> I have never added *any* users using sacctmgr - I've always just had
> everyone I guess
We'd bumped ours up for a while 20+ years ago when we had a flaky
network connection between two buildings holding our compute nodes. If you
need more than 600s you have networking problems.
On Mon, Feb 12, 2024 at 5:41 PM Timony, Mick via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> We
;s no reason to baroquify it.
On Wed, Jan 4, 2023 at 1:54 PM Fulcomer, Samuel
wrote:
> Just make the cluster names the same, with different Nodename and
> Partition lines. The rest of slurm.conf can be the same. Having two cluster
> names is only necessary if you're running
Just make the cluster names the same, with different Nodename and Partition
lines. The rest of slurm.conf can be the same. Having two cluster names is
only necessary if you're running production in a multi-cluster
configuration.
Our model has been to have a production cluster and a test cluster wh
The NVIDIA A10 would probably work. Check the Dell specs for card lengths
that it can accommodate. It's also passively cooled, so you'd need to
ensure that there's good airflow through the card. The proof would be
installing a card, and watching the temp when you run apps on it. It's
150W, so not t
Hi Byron,
We ran into this with 20.02, and mitigated it with some kernel tuning. From
our sysctl.conf:
net.core.somaxconn = 2048
net.ipv4.tcp_max_syn_backlog = 8192
# prevent neighbour (aka ARP) table overflow...
net.ipv4.neigh.default.gc_thresh1 = 3
net.ipv4.neigh.default.gc_thresh2 = 320
>From our /etc/pam.d/sshd on our compute nodes
accountrequired pam_nologin.so
accountsufficientpam_access.so
accountinclude password-auth
-accountrequired pam_slurm_adopt.so
and /pam.d/password-auth:
#-session optional pam_systemd.so
Note that di
...it is a bit arcane, but it's not like we're funding lavish
lifestyles with our support payments. I would prefer to see a slightly more
differentiated support system, but this suffices...
On Thu, Mar 24, 2022 at 6:06 PM Sean Crosby wrote:
> Hi Jeff,
>
> The support system is here - https://bug
...and you shouldn't be able to do this with a QoS (I think as you want it
to), as "grptresrunmins" applies to the aggregate of everything using the
QoS.
On Thu, Dec 16, 2021 at 6:12 PM Fulcomer, Samuel
wrote:
> I've not parsed your message very far, but...
>
> fo
I've not parsed your message very far, but...
for i in `cat limit_users` ; do
sacctmgr where user=$i partition=foo account=bar set
grptresrunmins=cpu=Nlimit
On Thu, Dec 16, 2021 at 6:01 PM Ross Dickson
wrote:
> It would like to impose a time limit stricter than the partition limit on
> a certa
There's no clear answer to this. It depends a bit on how you've segregated
your resources.
In our environment, GPU and bigmem nodes are in their own partitions.
There's nothing to prevent a user from specifying a list of potential
partitions in the job submission, so there would be no need for the
ets, e.g.:
# 8-gpu A6000 nodes - dual-root
NodeName=gpu[1504-1506] Name=gpu Type=a6000 File=/dev/nvidia[0-3] CPUs=0-23
NodeName=gpu[1504-1506] Name=gpu Type=a6000 File=/dev/nvidia[4-7] CPUs=24-47
On Fri, Aug 20, 2021 at 6:01 PM Fulcomer, Samuel
wrote:
> Well... you've got lots of
apWatts=n/a
>
>CurrentWatts=0 AveWatts=0
>
>ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
>
> *Node2-3*
>
> NodeName=node02 Arch=x86_64 CoresPerSocket=16
>
>CPUAlloc=0 CPUTot=64 CPULoad=0.48
>
>AvailableFeatures=RTX6000
>
>
What SLURM version are you running?
What are the #SLURM directives in the batch script? (or the sbatch
arguments)
When the single GPU jobs are pending, what's the output of 'scontrol show
job JOBID'?
What are the node definitions in slurm.conf, and the lines in gres.conf?
Are the nodes all the
XDMoD can do that for you, but bear in mind that wait/pending time by
itself may not be particularly useful.
Consider the extreme scenario in which a user is only allowed to use one
node at a time, but submits a thousand one-day jobs. Without any other
competition for resources, the average wait/p
on, Jul 26, 2021 at 1:32 PM Jason Simms wrote:
> Dear Samuel,
>
> Restarting slurmctld did the trick. Thanks! I should have thought to do
> that, but typically sconrtrol reconfigure picks up most changes.
>
> Warmest regards,
> Jason
>
> On Mon, Jul 26, 2021 at 12:55
...and... you need to restart slurmctld when you change a NodeName line.
"scontrol reconfigure" doesn't do the truck.
On Mon, Jul 26, 2021 at 12:49 PM Fulcomer, Samuel
wrote:
> If you have a dual-root PCIe system you may need to specify the CPU/core
> affinity in gres.con
If you have a dual-root PCIe system you may need to specify the CPU/core
affinity in gres.conf.
On Mon, Jul 26, 2021 at 12:07 PM Jason Simms wrote:
> Hello all,
>
> I have a GPU node with 3 identical GPUs (we started with two and recently
> added the third). Running nvidia-smi correctly shows th
Jason,
I've just been working through a similar scenario to handle access to our
3090 nodes that have been purchased by researchers.
I suggest putting the node into an additional partition, and then add a QOS
for the lab group that has grptres=gres/gpu=1,cpu=M,mem=N (where cpu and
mem are whateve
You can specify a partition priority in the partition line in slurm.conf,
e.g. Priority=65000 (I forget what the max is...)
On Thu, Jun 17, 2021 at 10:31 PM wrote:
> Thanks for the help. We tried to reduce the sched_interval and the pending
> time decreased as expected.
>
> But the influence of
...sorry... "sinfo | grep drain && sinfo | grep drain | mail -s 'drain
nodes' "
On Sat, Jun 12, 2021 at 4:46 PM Fulcomer, Samuel
wrote:
> ...something like "sinfo | grep drain && mail -s 'drain nodes' address> "
>
> ...will
...something like "sinfo | grep drain && mail -s 'drain nodes' "
...will work...
Substitute "draining" or "drained" for "drain" to taste...
On Sat, Jun 12, 2021 at 4:32 PM Rodrigo Santibáñez <
rsantibanez.uch...@gmail.com> wrote:
> Hi SLURM users,
>
> Does anyone have a cronjob or similar to m
inline below...
On Sat, Apr 3, 2021 at 4:50 PM Will Dennis wrote:
> Sorry, obvs wasn’t ready to send that last message yet…
>
>
>
> Our issue is the shared storage is via NFS, and the “fast storage in
> limited supply” is only local on each node. Hence the need to copy it over
> from NFS (and th
nd, or something like /tmp… That’s why my desired
> workflow is to “copy data locally / use data from copy / remove local copy”
> in separate steps.
>
>
>
>
>
> *From: *slurm-users on behalf of
> Fulcomer, Samuel
> *Date: *Saturday, April 3, 2021 at 4:00 PM
> *To:
Unfortunately this is not a good workflow.
You would submit a staging job with a dependency for the compute job;
however, in the meantime, the scheduler might launch higher-priority jobs
that would want the scratch space, and cause it to be scrubbed.
In a rational process, the scratch space would
Durai,
There is no inheritance in "AllowAccounts". You need to specify each
account explicitly.
There _is_ inheritance in fairshare calculation.
On Fri, Jan 15, 2021 at 2:17 PM Brian Andrus wrote:
> As I understand it, the parents are really meant for reporting, so you
> can run reports that a
lly through sql hacking; however, we just went with a virgin database
when we last upgraded in order to get it working (and sucked the accounting
data into XDMoD).
On Thu, Jan 14, 2021 at 6:36 PM Fulcomer, Samuel
wrote:
> AllowedDevicesFile should not be necessary. The relevant devices a
AllowedDevicesFile should not be necessary. The relevant devices are
identified in gres.conf. "ConstrainDevices=yes" should be all that's needed.
nvidia-smi will only see the allocated GPUs. Note that a single allocated
GPU will always be shown by nvidia-smi to be GPU 0, regardless of its
actual h
Important notes...
If requesting more than one core and not using "-N 1", equal numbers of
GPUs will be allocated on each node where the cores are allocated. (i.e. if
requesting 1 GPU for a 2-core job, if one core is allocated on each of two
nodes, one GPU will be allocated on each node).
If you
e also described in the
> RELEASE_NOTES file.
>
> So I wouldn't go directly to 20.x, instead I would go from 17.x to 19.x
> and then to 20.x
>
> -Paul Edmon-
> On 11/2/2020 8:55 AM, Fulcomer, Samuel wrote:
>
> We're doing something similar. We're continuing t
We're doing something similar. We're continuing to run production on 17.x
and have set up a new server/cluster running 20.x for testing and MPI app
rebuilds.
Our plan had been to add recently purchased nodes to the new cluster, and
at some point turn off submission on the old cluster and switch e
Compile slurm without ucx support. We wound up spending quality time with
the Mellanox... wait, no, NVIDIA Networking UCX folks to get this sorted
out.
I recommend using SLURM 20 rather than 19.
regards,
s
On Thu, Oct 22, 2020 at 10:23 AM Michael Di Domenico
wrote:
> was there ever a result
cgroups should work correctly _if_ you're not running with an old corrupted
slurm database.
There was a bug in a much earlier version of slurm that corrupted the
database in a way that the cgroups/accounting code could no longer fence
GPUs. This was fixed in a later version, but the database corru
"-N 1" restricts a job to a single node.
We've continued to have issues with this. Historically we've had a single
partition with multiple generations of nodes segregated for
multinode scheduling via topology.conf. "Use -N 1" (unless you really know
what you're doing) only goes so far.
There are
If you use cgroups, tmpfs /tmp and /dev/shm usage is counted against the
requested memory for the job.
On Tue, Mar 31, 2020 at 4:51 PM Ellestad, Erik
wrote:
> How are folks managing allocation of local TmpDisk for jobs?
>
> We see how you define the location of TmpFs in slurm.conf.
>
> And then
Thanks! and I'll watch the video...
Privileged containers! never!
On Thu, Sep 19, 2019 at 9:06 PM Michael Jennings wrote:
> On Thursday, 19 September 2019, at 19:27:38 (-0400),
> Fulcomer, Samuel wrote:
>
> > I obviously haven't been keeping up with any securit
Hey Michael,
I obviously haven't been keeping up with any security concerns over the use
of Singularity. In a 2-3 sentence nutshell, what are they?
I've been annoyed by NVIDIA's docker distribution for DGX-1 & friends.
We've been setting up an ersatz-secure SIngularity environment for use of
mid
...and for the SchedMD folks, it would be a lot simpler to
drop/disambiguate the "year it was released" first element in the version
number, and just use it as an incrementing major version number.
On Tue, Jul 9, 2019 at 6:42 PM Fulcomer, Samuel
wrote:
> Hi Pariksheet,
>
&
#x27;ve suggested
some documentation clarification, but it's still somewhat easily missed.
Regards,
Sam
On Tue, Jul 9, 2019 at 6:23 PM Pariksheet Nanda
wrote:
> Hi Samuel,
>
> On Mon, Jul 8, 2019 at 8:19 PM Fulcomer, Samuel
> wrote:
> >
> > The underlying issu
Hi Pariksheet,
Note that an "upgrade", in the sense that retained information is converted
to new formats, is only relevant for the slurmctld/slurmdbd (and backup)
node.
If you're planning downtime in which you quiesce job execution (i.e.,
schedule a maintenance reservation), and have image conf
Hi Palle,
You should probably get the latest stable SLURM version from
www.schedmd.com and use the build/install instructions found there. Note
that you should check for WARNING messages in the config.log produced by
SLURM's configure, as they're the best place to find you've missing
packages tha
t go to a 3
> month moving window to allow people to bank their fairshare, but we haven't
> done that yet as people have been having a hard enough time understanding
> our current system. It's not due to its complexity but more that most
> people just flat out aren't cog
t rely purely on fairshare weighting for
> resource usage. It has worked pretty well for our purposes.
>
> -Paul Edmon-
> On 6/19/19 3:30 PM, Fulcomer, Samuel wrote:
>
>
> (...and yes, the name is inspired by a certain OEM's software licensing
> schemes...)
>
>
ting of 130 CPUs
> because the CPUs are normalized to the old performance. Since it would
> probably look bad politically to reduce someone's number, but giving a new
> customer a larger number should be fine.
>
> Regards,
> Alex
>
> On Wed, Jun 19, 2019 at 12:32 PM Fu
(...and yes, the name is inspired by a certain OEM's software licensing
schemes...)
At Brown we run a ~400 node cluster containing nodes of multiple
architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade) purchased in
some cases by University funds and in others by investigator funding
(~50:
On Mon, May 20, 2019 at 2:59 PM wrote:
>
>
>
> I did test setting GrpTRESRunMins=cpu=N for each user + account
> association, and that does appear to work. Does anyone know of any other
> solutions to this issue?
No. Your solution is what we currently do. A "...PU" would be a nice, tidy
additio
ing?
>
> Prentice
>
> On 4/16/19 1:12 PM, Fulcomer, Samuel wrote:
>
> We had an AC921 and AC922 as a while as loaners.
>
> We had no problems with SLURM.
>
> Getting POWERAI running correctly (bugs since fixed in newer release) and
> apps properly built and linked t
We had an AC921 and AC922 as a while as loaners.
We had no problems with SLURM.
Getting POWERAI running correctly (bugs since fixed in newer release) and
apps properly built and linked to ESSL was the long march.
regards,
s
On Tue, Apr 16, 2019 at 12:59 PM Prentice Bisbal wrote:
> Sergi,
>
>
mit all jobs to all partitions plugin and
> having users constrain to specific types of nodes using the
> --constraint=whatever flag.
>
>
> Nicholas McCollum
> Alabama Supercomputer Authority
> --
> *From:* "Fulcomer, Samuel"
> *S
We use topology.conf to segregate architectures (Sandy->Skylake), and also
to isolate individual nodes with 1Gb/s Ethernet rather than IB (older GPU
nodes with deprecated IB cards). In the latter case, topology.conf had a
switch entry for each node.
It used to be the case that SLURM was unhappy wi
y delete the association with the
> following command after the users' jobs completes.
>
> # sacctmgr delete user where name=clschf partition=k80 account=acct-clschf
>
> Best,
>
> Jianwen
>
> On Dec 29, 2018, at 11:50, Fulcomer, Samuel
> wrote:
>
> ...rig
...right. An association isn't an "entity". You want to delete a "user"
where name=clschf partition=k80 account=acct-clschf .
This won't entirely delete the user entity, only the record/association
matching the name/partition/account spec.
The foundation of SLURM nomenclature has some unfortunate
Yes, in a way. In thinking about this for Brown (we haven't implemented it,
yet), we've the idea of having a Linux cron job periodically query the
group membership of the AD group granted access to the HPC resource, and
adding any new users to the SLURM accounting database.
We're at the point of u
We've got 15.0.8/9.
-s
On Wed, Oct 24, 2018 at 5:51 PM, Bob Healey wrote:
> I'm in the process of upgrading a system that has been running 2.5.4 for
> the last 5 years with no issues. I'd like to bring that up to something
> current, but I need a a bunch of older versions that do not appear to
Is there a firewall turned on? What does "iptables -L -v" report on the
three hosts?
On Mon, May 21, 2018 at 11:05 AM, Turner, Heath wrote:
> If anyone has advice, I would really appreciate...
>
> I am running (just installed) slurm-11.17.6, with a master + 2 hosts. It
> works locally on the ma
This came up around 12/17, I think, and as I recall the fixes were added to
the src repo then; however, they weren't added to any fo the 17.releases.
On Wed, May 2, 2018 at 6:04 AM, R. Paul Wiegand wrote:
> I dug into the logs on both the slurmctld side and the slurmd side.
> For the record, I h
We use GrpTresRunMins for this, with the idea that it's OK for users to
occupy lots of resources with short-running jobs, but not so much with
long-running jobs.
On Wed, Feb 7, 2018 at 8:41 AM, Bill Barth wrote:
> Of course, Matteo. Happy to help. Our job completion script is:
>
> #!/bin/bash
>
64 matches
Mail list logo