of a running based on its jobid?
>
>
> Regards,
> Mahmood
--
Tina Friedrich, Snr HPC Systems Administrator, Advanced Research Computing
Research Computing and Support Services, Academic IT
IT Services, University of Oxford
http://www.arc.ox.ac.uk
just ssh'ing to a node and running xterm/etc.
> >>
> >> With srun, however:
> >>
> >> srun -n1 --pty --x11 xterm
> >> srun: error: Unable to allocate resources: X11 forwarding not available
> >>
> >> So, what am I missing?
> >>
> >> Thanks.
> >>
> >> PS
> >>
before).
> Regular ssh forwarding works fine.
>
> On Tue, Oct 16, 2018 at 09:47:21AM +0100, Tina Friedrich wrote:
> > I had an issue getting x11 forwarding via SLURM (srun/sbatch) to work; ssh
> > worked fine. Tracked it down to the host name setting on the nodes; as per
>
;knl_generic' plugin enabled; there were some KNL nodes,
although they are no longer there. Still, I'm not even requesting 'knl' here?)
Google didn't really yield anything, so I thought asking might be quicker.
Thanks!
Tina
--
Tina Friedrich, Snr HPC Systems Administ
Hello,
two things; you don't actually seem to have the '--x11' flag on your
srun command? I.e. does 'srun --x11 --nodelist=compute-0-5 -n 1 -c 6
--mem=8G -A y8 -p RUBY xclock' get you any further?
I had some trouble getting the inbuild X forwarding to work, which had
to do with hostnames & xau
I agree with you on that one - I'd forgotten about that detail. The
having to actually do an 'ssh -X' before you can do 'srun --x11' is
quite silly, and a bit aggravating.
You can do 'ssh -X localhost' and then try the srun; that should work,
as well.
Tina
On 21/11/2018 18:04, Mahmood Naderan
I really don't want to start a flaming discussion on this - but I don't
think it's an unusual situation. I have, in likewise roughtly 15 years
of doing this, not ever worked anywhere where people didn't have a GUI
to submit from. It's always been a case of 'Wand to use the cluster?
We'll make y
tly integrated with our environment for our staff to
>> submit and monitor their jobs from (they don't have to touch a single
>> job script).
>>
>> On Thu, Nov 22, 2018 at 6:28 PM Tina Friedrich
>> mailto:tina.friedr...@it.ox.ac.uk>> wrote:
>>
>&g
I'm running 18.08.3, and I have a fair number of GPU GRES resources -
recently upgraded to 18.08.03 from a 17.x release. It's definitely not
as if they don't work in an 18.x release. (I do not distribute the same
gres.conf file everywhere though, never tried that.)
Just a really stupid question
4,06-09,11-14,16-19,21-22] Name=gpu Type=k20
> File=/dev/nvidia[0-1] Cores=0,1
>
> What am I missing?
>
> Thanks...
>
>
>
>
> On Wed, Dec 5, 2018 at 4:59 AM Tina Friedrich
> mailto:tina.friedr...@it.ox.ac.uk>> wrote:
>
> I'm running
that necessary or is
> that just a sanity check?
>
> Once again, I like to thank all contributors to this thread... It has
> helped me get my cluster going!
>
> Thanks.
> Lou
>
>
>
> On Wed, Dec 5, 2018 at 9:41 AM Tina Friedrich
> mailto:tina.friedr...@i
Indeed - am I the only person that finds that quite a bit annoying? A
lot of interactive software works a lot better over things like NX, so
why this limitation?
Tina
(I realise I'm not adding much the discussion, probably :) )
On 15/05/2019 08:36, Marcus Wagner wrote:
> Dear Mahmood,
>
> ple
Hadn't yet read that far - I plan to test 19.05 soon anyway. Will report.
(I thought the plumbing was - basically - libssh; and, well, ssh itself
is capable of dealing with local displays?)
Tina
On 15/05/2019 15:06, Chris Samuel wrote:
> On 15/5/19 3:01 am, Tina Friedrich wrote:
>
Hi Lawrence,
no, as far as I can tell, SLURM doesn't have any way to allow users to
submit/create advance reservations.
Could you get around it with sudo? It would be easy to allow a group of
user to run 'sudo scontrol create ' (or a suitable wrapper script,
to make the syntax easy). It'd
Hi Jose,
I run my slumrctld (and the database) in a VM. Some of my
test/development nodes are VMs, as well. Actual worker nodes are
hardware, for performance reasons :)
Is it the SLURM controller that you're planning to run as a VM, or the
whole cluster?
Tina
On 12/09/2019 15:23, Jose A wrot
I second that question - I'm using the same combination :)
I know there's some efforts - see
https://slurm.schedmd.com/SLUG16/monitoring_influxdb_slug.pdf - but I
don't know exactly what the state of that is at the moment.
(I resorted to telegraf's 'execute script' plugin to pump some
informat
Hello,
is there a possibility to tie a reservation to a QoS (instead of an
account or user), or enforce a QoS for jobs submitted into a reservation?
The problem I'm trying to solve is - some of our resources are bought on
a co-investment basis. As part of that, the 'owning' group can get very
nking has made dedicated partitions and QOSes
> something we have not had to deal with as CPU time per 30 day sliding
> window has been accepted, can be quantitatively shown, and just is a
> much easier way to schedule when ALL resources can be used.
>
> Bill
>
> On 10/28/19
ave to solve it with some scripting, then.
Tina
On 28/10/2019 19:02, Kurt H Maier wrote:
> On Mon, Oct 28, 2019 at 06:40:48PM +, Tina Friedrich wrote:
>> That's fine and all sounds nice but doesn't precisely help me solve my
>> problem - which is how to ensure that peo
Hi Angelines,
I use a plugin for that - I believe this one
https://github.com/hpc2n/spank-private-tmp
which sort of does it all; your job sees an (empty) /tmp/.
(It doesn't do cleanup, I simply rely on OS cleaning up /tmp, at the
moment.)
Tina
On 05/12/2019 15:57, Angelines wrote:
> Hello,
>
Hello,
shame this seems to be the last message in this thread!
I'm currently banging against the same problem on a test system.
Did anyone get that to run? If yes, how exactly did you build the packages?
Tina
On 01/11/2019 18:19, Michael Jennings wrote:
> On Friday, 01 November 2019, at 10:41:
_hardened_cflags “-Wl,-z,lazy”
> %global _hardened_ldflags “-Wl,-z,lazy”
>
>
>
> -James
>
> -Original Message-
> From: slurm-users On Behalf Of Tina
> Friedrich
> Sent: Friday, February 21, 2020 10:40 AM
> To: slurm-users@lists.schedmd.com
> Subject: Re:
I remember having issues when I set up X forwarding that had to do with
how the host names were set on the nodes. I had them set (CentOS
default) to the fully qualified hostname, and that didn't work - with an
error message very similar to what you're getting, if memory serves
right. 'Fixed' it
Hi Bas,
I wish I'd known that two years ago, might've saved me some setting up
(if it was around two years ago). My SLURM configuration is also
CFEngine3 controlled. So I'm quite interested in sharing .
Having a look at it in a minute...
Tina
On 13/07/2020 22:15, Bas van der Vlies wrote:
Hi Peter,
is this an actual NFS server, or something exporting NFS (like a NetApp).
This might be a silly question but - if it's an actual server, could you
run the slurmdb server on the NFS server? There would then be no need
for any clustered DB service or anything; it would simply make the
Hello,
This is something I've seen once on our systems & it took me a while to
figure out what was going on.
The solution was that the system topology was such that all GPUs were
connected to one CPU. There were no free cores on that particular CPU;
so SLURM did not schedule any more jobs to
G"
MaxCPUsPerNode=48
I have played tried variations for gres.conf such as:
NodeName=c0005 Name=gpu File=/dev/nvidia[0-1] CPUs=0,2
NodeName=c0005 Name=gpu File=/dev/nvidia[2-3] CPUs=1,3
as well as trying CORES= (rather than CPUSs) with NO success.
I’ve battled this all week. Any suggestions
ev/nvidia3 CPUs=[1,3,5,7,11,13,15,17,29]
I also tried your suggetions of 0-13, 14-27, and a combo.
I still only get 2 jobs to run on gpus at a time. If I take off the “CPUs=“, I
do get 4 jobs running per node.
Jodie
On Aug 7, 2020, at 12:18 PM, Tina Friedrich wrote:
Hi Jodie,
what version of SLURM are you us
Script. Not doing manual anything if it can at all be avoided, way to
error prone.
We have a cron job that does all of that. Checks if there are users or
groups in LDAP that aren't in SLURM yet, and adds them - that's adding
accounts, adding users, I think it also removed users/accounts i
Hi List,
apologies if this has been asked before (or is obvious) - I did do some
reading & searching but can't quite figure the best way to achieve this.
Background - we have two productions clusters, both running SLURM. They
are not currently a multi-cluster setup; they are not running the s
Yeah, I had that problem as well (trying to set up a partition that
didn't have any nodes - they're not here yet).
I figured that one can have partitions with nodes that don't exist,
though. As in, not even in DNS.
I currently have this:
[arc-slurm ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES
jul...@sdstate.edu>
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
hat crossed my mind is I might have missed a compile option
or compile dependency (but I'm not sure which it would be if it were
that - it's not as if the binding doesn't work at all.)
In short - am a bit stumped; any help welcome!
Tina
--
Tina Friedrich, Advanced Rese
.org/pub/epel/7/$basearch>
metalink=https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=$basearch&infra=$infra&content=$contentdir
<https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=$basearch&infra=$infra&content=$contentdir>
failovermethod=prio
use to build our vast array
of RPMs and they work just fine on our GPU nodes.
All the best,
Chris
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
: cannot find auth plugin for auth/munge
slurmd: error: cannot create auth context for auth/munge
slurmd: error: slurmd initialization failed
command terminated with exit code 1
Any advice?
thank you
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Compu
com/job_container.conf.html
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
- Tim
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
and wipe the sssd cache for the user.
- Kill all their processes on the login nodes
- Move the data
- Re-enable the user in the LDAP
- Remove any blocks/limits of the user to start new job
- Mail the user that he/she can continue working again.
The whole process went pretty smooth.
Ward
--
Ti
se? I'm assuming it wouldn't, but I figured it
safe to ask questions first and shoot later.
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
ty error, then X11 support was definitely
compiled into Slurm. The most common cause of .Xauthority issues is the
user's home directory hitting their quota limit. Could that be the case
here?
--
Prentice
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Researc
eady have for PBS, and will
create for Slurm, if something doesn't already exist).
Thank you all,
David
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
s that used it in the database :)
Can you elaborate on what you mean by "renaming"?
--
Prentice
On 4/19/21 8:55 AM, Tina Friedrich wrote:
Hi Prentice,
I've just done that on one of my test systems - and it's not deleting
a no longer used QoS, but 'renaming' th
en the slurm controller... you are making a huge
issue of a very basic task...
Sid
On Tue, 4 May 2021, 22:28 Tina Friedrich, <mailto:tina.friedr...@it.ox.ac.uk>> wrote:
Hello,
a lot of people already gave very good answer to how to tackle this.
Still, I thought it wort
ich
that person has a running job without any further ado, i.e. without the
necessity to set up anything else or to enter any credentials.
Is this assumption correct?
If so, how can I best debug what I have done wrong?
Cheers,
Loris
--
Tina Friedrich, Advanced Research Computing Snr HPC
@c005's password:
My assumption was that a user should be able to log into a node on
which
that person has a running job without any further ado, i.e. without the
necessity to set up anything else or to enter any credentials.
Is this assumption correct?
If so, how can I best debug what I ha
boot for changes to
take effect. Do we have to stop users submitting jobs by draining
all partitions and then restart the server. That is slurmctld.slurmdb
and mariadb? Or will the restarting of slurm vm have no effect on
running/pending iobs?
Sincerely
Amjad
--
Tina Friedri
ipts, so having it avoid by default would
work.
Any ideas how to do that? Submit LUA perhaps?
Brian Andrus
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
seem to be any benefit to that that
we could see.
Tina
On 02/07/2021 06:48, Loris Bennett wrote:
Hi Tina,
Tina Friedrich writes:
Hi Brian,
sometimes it would be nice if SLURM had what Grid Engine calls a 'forced
complex' (i.e. a feature that you *have* to request to land on a no
uly 2, 2021 12:48 AM
To: Slurm User Community List
Subject: Re: [slurm-users] How to avoid a feature?
◆ This message was sent from a non-UWYO address. Please exercise caution when
clicking links or opening attachments from external sources.
Hi Tina,
Tina Friedrich writes:
Hi Brian,
sometimes it
bug fixes.
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
- Tim
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
With Thanks and regards
>
> so, without having checked your sacct/awk logic I would not expect the
results to be the same.
>
> Cheers,
>
> Loris
>
> --
> Dr. Loris Bennett (Hr./Mr.)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
--
Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
Thanks for your help
slurmd -V
slurm 20.02.6
slurm.conf
TaskPlugin=task/affinity,task/cgroup
ProctrackType=proctrack/cgroup
cgroup.conf
AllowedRAMSpace=100.0
AllowedSwapSpace=0.0
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
MemorySwappiness=0
CgroupAutomount=yes
ConstrainCores=yes
--
Tin
lementation.
We wanted to set up 3 clusters and one Login Node to run the job using
-M cluster option.
can anybody have such a setup and can share some insight into how it
works and it is really a stable solution.
Regards
Navin.
--
Tina Friedrich, Advanced Research Computing Snr HPC Sy
on the login node in slurm .conf file pointed to which Slurmctld?
is it possible to share the sample slurm.conf file of login Node.
Regards
Navin.
On Thu, Oct 28, 2021 at 7:06 PM Tina Friedrich
mailto:tina.friedr...@it.ox.ac.uk>> wrote:
Hi Navin,
well, I have two cl
H to the correct slurm
binaries (which we install in /usr/local/slurm// so that
they co-exists). So when the -M won't work, users can use:
module load slurm/clusterA
squeue
module load slurm/clusterB
squeue
BR,
On Thu, Oct 28, 2021 at 7:39 PM na
is settings is also necessary on the
compute nodes ?
Best;
Jeremy.
[1]
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Supp
edmd.com/show_bug.cgi?id=3094
Best,
--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
CA, Adeilad Redwood,
King Edward VII Avenue,
Caerdydd, CF10 3NB
simpso...@cardiff.ac.uk <mailto:simpso...@cardiff.ac.uk>
+44 29208 74657
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Servi
n - Senior Systems Engineer
ARCCA, Redwood Building,
King Edward VII Avenue,
Cardiff, CF10 3NB
David Simpson - peiriannydd uwch systemau
ARCCA, Adeilad Redwood,
King Edward VII Avenue,
Caerdydd, CF10 3NB
simpso...@cardiff.ac.uk <mailto:simpso...@card
ending on which partition gets selected by Slurm.
Can this be done?
An option similar to --ntasks=USE_ALL_CORES would be great.
Many thanks,
Richard
--
Richard Ems / aiduit / r@aiduit.com
<mailto:r@aiduit.com>
--
Tina Friedrich, Advanced Research Computing Snr HPC
allocated gpu card? What is the requirement on nvidia gpu drivers,
CUDA toolkit or any other part to help slurm correctly restrict the gpu
usage?
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services,
hone number or email address
appearing above. The writer asserts in respect of this message and attachments
all rights for confidentiality, privilege or privacy to the fullest extent
permitted by law.
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
s seems to be overkill of using only
this feature.
Is there any other plugin that implements this feature?
Best,
Stefan
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.
Samuel : http://www.csamuel.org/ <http://www.csamuel.org/>
: Berkeley, CA, USA
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
he
things that will enable them to get reports,logs whatever an admin and a user
will need. Just not execution of the jobs.
Thanks in advance for your help.
RC.
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, Unive
AcctGatherFrequency=30
SlurmctldDebug=error
SlurmdDebug=error
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdLogFile=/var/log/slurmd.log
NodeName=node0[1-8] CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
ThreadsPerCore=1 State=UNKNOWN
PartitionName=short Nodes=node[01-08] Default=NO MaxTim
Data Scientist
Information Technology
The University of Chicago
Booth School of Business
5807 S. Woodlawn
Chicago,Illinois60637
Phone: +(1) 773-834-4556
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, Unive
jobs which reboot nodes - With a for loop, I could
submit a reboot job for each node. But I'm not sure how to limit this so at
most N jobs are running simultaneously.
With a fake license called reboot?
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
...job dependencies are also an option, thinking about this. You could
carve it up into X 'sets' of N nodes, with node-specific reboot jobs
that depend on the previous job in the same 'N' to finish.
Tina
On 04/08/2022 11:23, Tina Friedrich wrote:
I'm thinking some
ion on the basis of information in this e-mail or any
attachments. The DRW Companies make no representations that this e-mail or any
attachments are free of computer viruses or other defects.
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computin
...not sure I'm adding anything to this discussion, but we have used the
various comment fields to store extra information for internal purposes -
sometimes JSON format strings so it can be parsed by scripts etc. I even once
managed to mod the Elasticsearch plugin so that the comment field made
nd trying xclock on the
login node would clarify that
Sorry, yes running xterm, xclock, etc. on the login node works.
Thanks,
Allan
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
Or run your database server on something like VMWare ESXi (which is what
we do). Instant HA and I don't even need multiple servers for it :)
I don't mean to be flippant, and I realise it's not addressing the mysql
HA question (but that got answered). However, a lot of us will have some
sort of
Hi Will,
I don't, currently, although it's on my list.
However, we had a presentation on a recent Oxford HPC-SIG meeting from a
colleague, who implemented a simple job profiler that saves a lot of job
data (including efficiency) & creates plots of the efficiency of the job
run (in a nutshell)
Hi Mike,
I moved from Grid Engine to SLURM a couple of years ago & it took me a
while to get my head around this :)
Yes - and you could also just edit slurm.conf and restart the
controller. That will not affect running jobs. It's - both in my
experience and from all I read - absolutely safe
Hi Patrick,
we certainly use that information to set affinity, yes. Our gres.conf
files (node-specific, as our config management creates them locally from
'nvidia-smi topo -m') - look like this:
Name=gpu Type=a100 File=/dev/nvidia0 CPUs=0-23
Name=gpu Type=a100 File=/dev/nvidia1 CPUs=0-23
Nam
We do the same as Josef - we run the database on a VM (single VM,
MariaDB) and leave it up to (in our case) VMWare to ensure its availability.
Tina
On 25/01/2024 11:34, Josef Dvoracek wrote:
To protect from HW failure, and to have more free hands when upgrading
underlying OS, we use virtualiza
77 matches
Mail list logo