Dear Xaver,
Could you clarify the function of what you call "master"?
If it's the Slurm controller, i.e. running slurmctld: Why do you need
slurmd running on it as well?
Best,
Stephan
On 24.06.24 13:54, Xaver Stiensmeier via slurm-users wrote:
Dear Slurm users,
in our project we exclude th
Markus, thanks for the heads-up.
I intend to either reserve specific nodes with GPUs or use features.
Best,
Stephan
On 13.09.23 09:08, Markus Kötter wrote:
Hi,
currently reservations do not work for gres.
https://bugs.schedmd.com/show_bug.cgi?id=5771
23.11 might change this.
MfG
Thanks Chris, this completes what I was looking for.
Should have had a better look at the scontrol man page.
Best,
Stephan
On 13.09.23 02:24, Chris Samuel wrote:
On 12/9/23 9:22 am, Stephan Roth wrote:
Thanks Noam, this looks promising!
I would suggest that as was as the "magnetic&
that they specify the name of the reservation. The
reservation will only "attract" jobs that meet the access control
requirements.
(from https://slurm.schedmd.com/reservations.html
<https://slurm.schedmd.com/reservations.html>)
On Sep 12, 2023, at 10:14 AM, Step
Dear Slurm users,
I'm looking to fulfill the requirement of guaranteeing availability of
GPU resources to a Slurm account, while allowing this account to use
other available GPU resources as well.
The guaranteed GPU resources should be of at least 1 type, optionally up
to 3 types, as in:
Gr
ware of any job crashes. Your
mileage may vary depending on job types!
Question: Does anyone have bad experiences with upgrading slurmd while
the cluster is running production?
/Ole
--
ETH Zurich
Stephan Roth
Systems Administrator
IT Support Group (ISG)
D-ITET
ETF D 104
Sternwartstrasse 7
8092
Hi Byron,
If you have the means to set up a test environment to try the upgrade
first, I recommend to do it.
The upgrade from 19.05 to 20.11 worked for two clusters I maintain with
a similar NFS based setup, except we keep the Slurm configuration
separated from the Slurm software accessible
On 17.05.22 17:17, Timo Rothenpieler wrote:
On 17.05.2022 15:58, Brian Andrus wrote:
You are starting to understand a major issue with most containers.
I suggest you check out Singularity, which was built from the ground
up to address most issues. And it can run other container types (eg:
do
recipient,
please contact the sender by return electronic mail and delete all
copies of this communication
--
ETH Zurich
Stephan Roth
Systems Administrator
IT Support Group (ISG)
D-ITET
ETF D 104
Sternwartstrasse 7
8092 Zurich
Phone +41 44 632 30 59
stephan.r...@ee.ethz.ch
www.isg.ee.eth
Hi Diego,
I don't know about MPICH, but in case you haven't done this already, you
might check the Slurm side if everything is ready:
Did you make sure your Slurm was built with PMI support (as in
`configure ... --with-pmix=/path/to/pmix`)?
Do you see MPI types:
srun --mpi=list
Does a tes
On 02.02.22 18:32, Michael Di Domenico wrote:
On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote:
The problem is to identify the cards physically from the information we
have, like what's reported with nvidia-smi or available in
/proc/driver/nvidia/gpus/*/information
The serial number
Not a solution, but some ideas & experiences concerning the same topic:
A few of our older GPUs used to show the error message "has fallen off
the bus" which was only resolved by a full power cycle as well.
Something changed, nowadays the error messages is "GPU lost" and a
normal reboot reso
, we want to use EGL backend for accessing OpenGL without the
need for Xorg. This approach requires access to devices
/dev/dri/card* and /dev/dri/renderD* . Is there a way to give
access to these devices along with /dev/nvidia* which we use for
CUDA? Ideally as a single generic resource that would g
On 03.06.21 07:11, Ahmad Khalifa wrote:
How to send a job to a particular gpu card using its ID (0,1,2...etc)?
Why do you need to access a GPU based on its ID?
If its to select a certain GPU type, there are other methods you can use.
You could create partitions for the same GPU types or add f
;> TaskPlugin=task/cgroup
>> ProctrackType=proctrack/cgroup
>>
>> ## Nodes list
>> ## use native GPUs
>> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16
ThreadsPerCore=2 RealMemory=1024000 State=UNKNOWN Gres=gpu:8
Feature=
tition and will resume the
job. I
> am not deleting the partition here.
>
> Regards
> Navin.
>
>
>
>
>
>
>
---
Stephan Roth | ISG.EE D-ITET ETH Zurich | http://www.isg.ee.ethz.ch
+4144 632 30 59 | ETF D 104 | Sternwartstrasse 7 | 8092 Zurich
---
Hi all,
Does anyone have ideas or suggestions on how to automatically cancel
jobs which don't utilize the GPUs allocated to them?
The Slurm version in use is 19.05.
I'm thinking about collecting GPU utilization per process on all nodes
with NVML/nvidia-smi, update a mean value of the collect
In regard to Kota's initial question
... "Is there any way (commands, configurations, etc...) to see the
allocated GPU indices for completed jobs?" ...
I was in need of the same kind of information and found the following:
If
- ConstrainDevices is on
- SlurmdDebug is set to at least "debug"
Dear all,
Does anybody know of a way to detect whether a job is submitted with
srun, preferrably in job_submit.lua?
The goal is to allow interactive jobs only on specific partitions.
Any recommendation or best practice on how to handle interactive jobs is
welcome.
Thank you,
Stephan
Best,
Stephan
-------
Stephan Roth | ISG.EE D-ITET ETH Zurich | http://www.isg.ee.ethz.ch
+4144 632 30 59 | ETF D 104 | Sternwartstrasse 7 | 8092 Zurich
---
On 23.04.20 1
g
in it that has nv or cuda in its name.
Are you sure that slurm distributes nvidia binaries?
-Original Message-
From: slurm-users On Behalf Of Stephan
Roth
Sent: Friday, February 7, 2020 2:23 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] How to use Autodetect=nvml in
om wrote:
I just checked the .deb package that I build from source and there is nothing
in it that has nv or cuda in its name.
Are you sure that slurm distributes nvidia binaries?
-Original Message-
From: slurm-users On Behalf Of Stephan
Roth
Sent: Friday, February 7, 2020 2:23 AM
T
On 05.02.20 21:06, Dean Schulze wrote:
> I need to dynamically configure gpus on my nodes. The gres.conf doc
> says to use
>
> Autodetect=nvml
That's all you need in gres.conf provided you don't configure any
Gres=... entries for your nodes in your slurm.conf.
If you do, make sure the string ma
23 matches
Mail list logo