[slurm-users] Re: Using sharding

2024-07-05 Thread Arnuld via slurm-users
> On Fri, Jul 5, 2024 at 12:19 PM Ward Poelmans > via slurm-users wrote: > Hi Ricardo, > > It should show up like this: > > Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1) > What's the meaning of (S:0-1) here? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsub

[slurm-users] Re: error: unpack_header: protocol_version 9472 not supported

2024-06-23 Thread Arnuld via slurm-users
I found the problem. It was not that this node was trying to reach some machine. It was the other way around, some other machine (running controller) had this node in the config there, and hence that controller was trying to reach to this. It was a different slurm cluster. I removed the config from

[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

2024-06-23 Thread Arnuld via slurm-users
Alright, understood. On Sat, Jun 22, 2024 at 12:47 AM Christopher Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote: > On 6/21/24 3:50 am, Arnuld via slurm-users wrote: > > > I have 3500+ GPU cores available. You mean each GPU job requires at > > least one CPU?

[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

2024-06-21 Thread Arnuld via slurm-users
on the machine, one would expect a max of 4 jobs to run. > > Brian Andrus > > On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote: > > I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ > > cores. I want to run around 10 jobs in parallel on the GPU (mostly >

[slurm-users] Can Not Use A Single GPU for Multiple Jobs

2024-06-20 Thread Arnuld via slurm-users
I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores. I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs). PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but

[slurm-users] Re: Debian RPM build for arm64?

2024-06-13 Thread Arnuld via slurm-users
I dont' know much about Slurm but if you want to start troubleshooting then you need to isolate the step where error appears. From the output you have posted , it looks like you are using some automated script to download, extract and build Slurm. Look here: "/bin/sh -c cd /tmp && wget https://d

[slurm-users] Re: srun hostname - Socket timed out on send/recv operation

2024-06-11 Thread Arnuld via slurm-users
I enabled "debug3" logging and saw this in the node log: error: mpi_conf_send_stepd: unable to resolve MPI plugin offset from plugin_id=106. This error usually results from a job being submitted against an MPI plugin which was not compiled into slurmd but was for job submission command. error: _se

[slurm-users] srun hostname - Socket timed out on send/recv operation

2024-06-10 Thread Arnuld via slurm-users
I have two machines. When I run "srum hostname" on one machine (it's both a controller and a node) then I get the hostname fine but I get socket timed out error in these two situations: 1) "srun hostname" on 2nd machine (it's a node) 2) "srun -N 2 hostname" on controller "scontrol show node" show

[slurm-users] error: unpack_header: protocol_version 9472 not supported

2024-06-05 Thread Arnuld via slurm-users
I have built Slurm 23.11.7 on two machines. Both are running Ubuntu 22.04. While Slurm runs fine on one machine, on the 2nd machine it does not. First machine is both a controller and a node while the 2nd machine is just a node. On both machines, I built the Slurm debian package as per the Slurm do

[slurm-users] Slurm Build Error

2024-05-23 Thread Arnuld via slurm-users
Getting this error when I run "make install": echo >>"lib_ref.lo" /bin/bash ../../libtool --tag=CC --mode=link gcc -DNUMA_VERSION1_COMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -o lib_ref.la lib_ref.lo -lpthread -lm -lresolv libtool: link: a

[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-22 Thread Arnuld via slurm-users
h environment the job's executables are > built against. It probably need a couple of "similar" nodes to allow users > benefiting from the job queue to send their job to the place where > available. > > Good luck with your setup > > Sincerely, > > S. Zhang >

[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-22 Thread Arnuld via slurm-users
those nodes and with for the > ones that have them. > > Generally, so long as versions are compatible, they can work together. > You will need to be aware of differences for jobs and configs, but it is > possible. > > Brian Andrus > > On 5/22/2024 12:45 AM, Arnuld via slurm-us

[slurm-users] Building Slurm debian package vs building from source

2024-05-22 Thread Arnuld via slurm-users
We have several nodes, most of which have different Linux distributions (distro for short). Controller has a different distro as well. The only common thing between controller and all the does is that all of them ar x86_64. I can install Slurm using package manager on all the machines but this wil

[slurm-users] Re: Slurm Cleaning Up $XDG_RUNTIME_DIR Before It Should?

2024-05-15 Thread Arnuld via slurm-users
he prolog: > > MY_XDG_RUNTIME_DIR=/dev/shm/${USER} > mkdir -p $MY_XDG_RUNTIME_DIR > echo "export XDG_RUNTIME_DIR=$MY_XDG_RUNTIME_DIR" > > (in combination with private tmpfs per job). > > Ward > > On 15/05/2024 10:14, Arnuld via slurm-users wrote: > > I am using the l

[slurm-users] Slurm Cleaning Up $XDG_RUNTIME_DIR Before It Should?

2024-05-15 Thread Arnuld via slurm-users
I am using the latest slurm. It runs fine for scripts. But if I give it a container then it kills it as soon as I submit the job. Is slurm cleaning up the $XDG_RUNTIME_DIR before it should? This is the log: [2024-05-15T08:00:35.143] [90.0] debug2: _generate_patterns: StepId=90.0 TaskId=-1 [2024-

[slurm-users] Which "oci.conf" to use?

2024-05-13 Thread Arnuld via slurm-users
I have installed slurm and podman. I have replaced podman's default runtime as per the documentation to "slurm". Documentation says I need to choose one oci.conf: https://slurm.schedmd.com/containers.html#example Which one should I use? runc? crun? nvidia? -- slurm-users mailing list -- slurm-u

[slurm-users] RunTimeQuery never configured in oci.conf

2024-05-10 Thread Arnuld via slurm-users
I am using Slurm integrated with Podman. It runs the container fine and the Controller daemon log always says "WEXITSTATUS 0" . Container also runs successfully (it runs the python test program with no errors). But there are two things that I noticed: - slurmd.log says: " error: _get_container_

[slurm-users] Slurm With Podman - No child processes error

2024-05-08 Thread ARNULD via slurm-users
I have integrated Podman with Slurm as per the docs ( https://slurm.schedmd.com/containers.html#podman-scrun) and when I do a test run: "podman run hello-world" (this runs fine) $ podman run alpine hostname executable file `/usr/bin/hostname` not found in $PATH: No such file or directory sru

[slurm-users] Rootless Docker Errors with Slurm

2024-05-06 Thread ARNULD via slurm-users
I am trying to integrate Rootless Docker with Slurm. have set-up Rootless Docker as per the docs "https://slurm.schedmd.com/containers.html"; . I have scrum.lua, oci.conf (for crun) and slurm.conf in place. Then "~/.config/docker/daemon.json" and "~/.config/systemd/user/docker.service.d/override.