> On Fri, Jul 5, 2024 at 12:19 PM Ward Poelmans
> via slurm-users wrote:
> Hi Ricardo,
>
> It should show up like this:
>
> Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1)
>
What's the meaning of (S:0-1) here?
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsub
I found the problem. It was not that this node was trying to reach some
machine. It was the other way around, some other machine (running
controller) had this node in the config there, and hence that controller
was trying to reach to this. It was a different slurm cluster. I removed
the config from
Alright, understood.
On Sat, Jun 22, 2024 at 12:47 AM Christopher Samuel via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> On 6/21/24 3:50 am, Arnuld via slurm-users wrote:
>
> > I have 3500+ GPU cores available. You mean each GPU job requires at
> > least one CPU?
on the machine, one would expect a max of 4 jobs to run.
>
> Brian Andrus
>
> On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:
> > I have a machine with a quad-core CPU and an Nvidia GPU with 3500+
> > cores. I want to run around 10 jobs in parallel on the GPU (mostly
>
I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores.
I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based
jobs).
PROBLEM: Each job asks for only 100 shards (runs usually for a minute or
so), then I should be able to run 3500/100 = 35 jobs in parallel but
I dont' know much about Slurm but if you want to start troubleshooting then
you need to isolate the step where error appears. From the output you have
posted , it looks like you are using some automated script to download,
extract and build Slurm. Look here:
"/bin/sh -c cd /tmp && wget
https://d
I enabled "debug3" logging and saw this in the node log:
error: mpi_conf_send_stepd: unable to resolve MPI plugin offset from
plugin_id=106. This error usually results from a job being submitted
against an MPI plugin which was not compiled into slurmd but was for job
submission command.
error: _se
I have two machines. When I run "srum hostname" on one machine (it's both a
controller and a node) then I get the hostname fine but I get socket timed
out error in these two situations:
1) "srun hostname" on 2nd machine (it's a node)
2) "srun -N 2 hostname" on controller
"scontrol show node" show
I have built Slurm 23.11.7 on two machines. Both are running Ubuntu 22.04.
While Slurm runs fine on one machine, on the 2nd machine it does not. First
machine is both a controller and a node while the 2nd machine is just a
node. On both machines, I built the Slurm debian package as per the Slurm
do
Getting this error when I run "make install":
echo >>"lib_ref.lo"
/bin/bash ../../libtool --tag=CC --mode=link gcc
-DNUMA_VERSION1_COMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread
-ggdb3 -Wall -g -O1 -fno-strict-aliasing -o lib_ref.la lib_ref.lo
-lpthread -lm -lresolv
libtool: link: a
h environment the job's executables are
> built against. It probably need a couple of "similar" nodes to allow users
> benefiting from the job queue to send their job to the place where
> available.
>
> Good luck with your setup
>
> Sincerely,
>
> S. Zhang
>
those nodes and with for the
> ones that have them.
>
> Generally, so long as versions are compatible, they can work together.
> You will need to be aware of differences for jobs and configs, but it is
> possible.
>
> Brian Andrus
>
> On 5/22/2024 12:45 AM, Arnuld via slurm-us
We have several nodes, most of which have different Linux distributions
(distro for short). Controller has a different distro as well. The only
common thing between controller and all the does is that all of them ar
x86_64.
I can install Slurm using package manager on all the machines but this wil
he prolog:
>
> MY_XDG_RUNTIME_DIR=/dev/shm/${USER}
> mkdir -p $MY_XDG_RUNTIME_DIR
> echo "export XDG_RUNTIME_DIR=$MY_XDG_RUNTIME_DIR"
>
> (in combination with private tmpfs per job).
>
> Ward
>
> On 15/05/2024 10:14, Arnuld via slurm-users wrote:
> > I am using the l
I am using the latest slurm. It runs fine for scripts. But if I give it a
container then it kills it as soon as I submit the job. Is slurm cleaning
up the $XDG_RUNTIME_DIR before it should? This is the log:
[2024-05-15T08:00:35.143] [90.0] debug2: _generate_patterns: StepId=90.0
TaskId=-1
[2024-
I have installed slurm and podman. I have replaced podman's default runtime
as per the documentation to "slurm". Documentation says I need to choose
one oci.conf:
https://slurm.schedmd.com/containers.html#example
Which one should I use? runc? crun? nvidia?
--
slurm-users mailing list -- slurm-u
I am using Slurm integrated with Podman. It runs the container fine and the
Controller daemon log always says "WEXITSTATUS 0" . Container also runs
successfully (it runs the python test program with no errors).
But there are two things that I noticed:
- slurmd.log says: " error: _get_container_
I have integrated Podman with Slurm as per the docs (
https://slurm.schedmd.com/containers.html#podman-scrun) and when I do a
test run:
"podman run hello-world" (this runs fine)
$ podman run alpine hostname
executable file `/usr/bin/hostname` not found in $PATH: No such file or
directory
sru
I am trying to integrate Rootless Docker with Slurm. have set-up Rootless
Docker as per the docs "https://slurm.schedmd.com/containers.html"; . I have
scrum.lua, oci.conf (for crun) and slurm.conf in place. Then
"~/.config/docker/daemon.json" and
"~/.config/systemd/user/docker.service.d/override.
19 matches
Mail list logo