At some point when we were experimenting with MIG, I was being entirely
frustrated in getting it to work until I finally removed the autodetect from
gres.conf and explicitly listed the stuff instead. THEN it worked. I think
you can find the list of files that are the device files using nvidia
Dear Slurm Mailing List,
I am experiencing a problem which affects our cluster and for which I am
completely out of ideas by now, so I would like to ask the community for
hints or ideas.
We run a partition on our cluster containing multiple nodes with Nvidia
A100 GPUs (40GB), which we have s
I found that this is actually a known bug in Slurm so I'll note it here in case
anyone comes across this thread in the future:
https://bugs.schedmd.com/show_bug.cgi?id=10598
Steve
From: slurm-users on behalf of Wilson,
Steven M
Sent: Tuesday, July 18, 202
Hi Hermann,
count doesn't make a difference, but I noticed that when I reconfigure
slurm and do reloads afterwards, the error "gpu count lower than
configured" no longer appears - so maybe it is just because a
reconfigure is needed after reloading slurmctld - or maybe it doesn't
show the error an
Hello everyone,
Has anyone here ever ran MCNP6.2 parallel job via Slurm scheduler?
I am looking for a simple test job to test my software compilation.
Thank you,
Vlad Ozeryan
On 19/07/2023 15:04, Jan Andersen wrote:
Hmm, OK - but that is the only nvml.h I can find, as shown by the find
command. I downloaded the official NVIDIA-Linux-x86_64-535.54.03.run and
ran it successfully; do I need to install something else beside? A
google search for 'CUDA SDK' leads directly
In case you're developing the plugin in C and not LUA, behind the scenes the
LUA mechanism is concatenating all log_user() strings into a single variable
(user_msg). When the LUA code completes, the C code sets the *err_msg argument
to the job_submit()/job_modify() function to that string, then
Worth a try, but the documentation says that by default the count is the same
as the number of files specified...so, should automatically be 1.
If you want to stop the node from going to INVAL, you can always set
config_overrides in slurm.conf. That will tell the node what it has, instead
of w
Hi Xaver,
I think you are missing the "Count=..." part in gres.conf
It should read
NodeName=NName Name=gpu File=/dev/tty0 Count=1
in your case.
Regards,
Hermann
On 7/19/23 14:19, Xaver Stiensmeier wrote:
Okay,
thanks to S. Zhang I was able to figure out why nothing changed. While I
did re
Hmm, OK - but that is the only nvml.h I can find, as shown by the find
command. I downloaded the official NVIDIA-Linux-x86_64-535.54.03.run and
ran it successfully; do I need to install something else beside? A
google search for 'CUDA SDK' leads directly to NVIDIA's page:
https://docs.nvidia.co
oups, i found my error, i forgot to remove JobCompHost, found it after
reading this:
https://bugs.schedmd.com/show_bug.cgi?id=2322#c5
sorry for the noise
On 19/07/2023 14:51, Gérard Henry (AMU) wrote:
Hello all,
is it possible to have this configuration? i installed slurm on ubuntu
20 LTS, bu
Hello all,
is it possible to have this configuration? i installed slurm on ubuntu
20 LTS, but slurmctld refuses to start with messages:
[2023-07-19T14:37:59.563] Job completion MYSQL plugin loaded
[2023-07-19T14:37:59.563] debug: /var/log/slurm/jobcomp doesn't look
like a database name using
Hello Lorenzo,
Lorenzo Bosio writes:
> I'm developing a job submit plugin to check if some conditions are met before
> a job runs.
> I'd need a way to notify the user about the plugin actions (i.e. why its jobs
> was killed and what to do), but after a lot of research I could only write to
>
Hi Lorenzo,
On 7/19/23 14:22, Lorenzo Bosio wrote:
> I'm developing a job submit plugin to check if some conditions are met
> before a job runs.
> I'd need a way to notify the user about the plugin actions (i.e. why its
> jobs was killed and what to do), but after a lot of research I could only
Hello everyone,
I'm developing a job submit plugin to check if some conditions are met
before a job runs.
I'd need a way to notify the user about the plugin actions (i.e. why its
jobs was killed and what to do), but after a lot of research I could
only write to logs and not the user shell.
The
Okay,
thanks to S. Zhang I was able to figure out why nothing changed. While I
did restart systemctld at the beginning of my tests, I didn't do so
later, because I felt like it was unnecessary, but it is right there in
the fourth line of the log that this is needed. Somehow I misread it and
thoug
On 19/07/2023 11:47, Jan Andersen wrote:
I'm trying to build slurm with nvml support, but configure doesn't find it:
root@zorn:~/slurm-23.02.3# ./configure --with-nvml
...
checking for hwloc installation... /usr
checking for nvml.h... no
checking for nvmlInit in -lnvidia-ml... yes
configure: err
I'm trying to build slurm with nvml support, but configure doesn't find it:
root@zorn:~/slurm-23.02.3# ./configure --with-nvml
...
checking for hwloc installation... /usr
checking for nvml.h... no
checking for nvmlInit in -lnvidia-ml... yes
configure: error: unable to locate libnvidia-ml.so and/o
Alright,
I tried a few more things, but I still wasn't able to get past: srun:
error: Unable to allocate resources: Invalid generic resource (gres)
specification.
I should mention that the node I am trying to test GPU with, doesn't
really have a gpu, but Rob was so kind to find out that you do n
19 matches
Mail list logo