[slurm-users] Re: sbatch problem

Mihai Ciubancan via slurm-users Wed, 29 May 2024 01:19:48 -0700

Dear Hermann,

Sorry to come back to you, but just to understand...if I run thefollowing script:


#!/bin/bash

#SBATCH --partition=gpu
#SBATCH --time=24:00:00

#SBATCH --nodes=2
#SBATCH --exclusive

#SBATCH --job-name="test_job"
#SBATCH -o stdout_%j
#SBATCH -e stderr_%j

touch test.txt

# Print the hostname of the allocated node
echo "Running on host: $(hostname)"

# Print the start time
echo "Job started at: $(date)"

# Perform a simple task that takes a few minutes
echo "Starting the task..."
sleep 60

echo "GPU UUIDs:"
nvidia-smi --query-gpu=uuid --format=csv,noheader
echo $CUDA_VISIBLE_DEVICES

echo "Task completed."

# Print the end time
echo "Job finished at: $(date)"

I'm getting the following results:

Starting the task...
GPU UUIDs:
GPU UUIDs:
GPU-d4e002a9-409f-79bb-70e1-56c1a473a188
GPU-33b728e2-0396-368b-b9c3-8f828ca145b1
GPU-7d90f7d8-aadf-ba95-2409-8c57bd40d24b
GPU-30faa03a-0782-4b6c-dda2-e108159ba953
GPU-37d09257-2582-8080-223a-dd5a646fba43
GPU-c71cbb10-4368-d327-e0e5-56372aa4f10f
GPU-a413a75a-15b2-063e-638f-bde063af5c8e
GPU-bf12181a-e615-dcd4-5da2-9a518ae1af5d
GPU-dfec21c4-e30d-5a36-599d-eef2fd354809
GPU-15a11fe2-33f2-cd65-09f0-9897ba057a0c
GPU-2d971e69-8147-8221-a055-e26573950f91
GPU-22ee3c89-fed1-891f-96bb-6bbf27a2cc4b
0,1,2,3
0,1,2,3
Task completed.

When for the command echo $CUDA_VISIBLE_DEVICES I should get:

0,1,2,3
0,1,2,3,4,5,6,7

This for the some reason that I had problems with hostname?

Thank you,
Mihai


On 2024-05-28 13:31, Hermann Schwärzler wrote:

Dear Mihai,

you are not asking Slurm to provide you with any GPUs:

 #####SBATCH --gpus=12

So it doesn't reserve any for you and as a consequence also does not
set CUDA_VISIBLE_DEVICES for you.

nvidia-smi works, because it looks like you are not using cgroups at
all or at least not "ConstrainDevices=yes" in e.g. cgroup.conf.
So it "sees" all the GPUs that are installed in the node it's running
on even if none is reserved for you by Slurm.

Regards,
Hermann

On 5/28/24 12:07, Mihai Ciubancan wrote:

Dear Hermann,
Dear James,

Thank you both for your answers!

I have tried as you suggested using bash -c and it worked.

But when I'm trying the following script the "bash -c" trick doesn'twork:


#!/bin/bash

#SBATCH --partition=eli
#SBATCH --time=24:00:00

#SBATCH --nodelist=mihaigpu2,mihai-x8640
#####SBATCH --gpus=12
#SBATCH --exclusive

#SBATCH --job-name="test_job"
#SBATCH -o /data/mihai/stdout_%j
#SBATCH -e /data/mihai/stderr_%j

touch test.txt

# Print the hostname of the allocated node
srun bash -c 'echo Running on host: $(hostname)'

# Print the start time
echo "Job started at: $(date)"

# Perform a simple task that takes a few minutes
echo "Starting the task..."
sleep 20

srun echo "GPU UUIDs:"
srun nvidia-smi --query-gpu=uuid --format=csv,noheader
srun bash -c 'echo $CUDA_VISIBLE_DEVICES'

##echo "Task completed."

# Print the end time
echo "Job finished at: $(date)"

I don't get any output of the command srun bash -c 'echo$CUDA_VISIBLE_DEVICES':


Running on host: mihaigpu2
Running on host: mihai-x8640
Job started at: Tue May 28 13:02:59 EEST 2024
Starting the task...
GPU UUIDs:
GPU UUIDs:
GPU-d4e002a9-409f-79bb-70e1-56c1a473a188
GPU-33b728e2-0396-368b-b9c3-8f828ca145b1
GPU-7d90f7d8-aadf-ba95-2409-8c57bd40d24b
GPU-30faa03a-0782-4b6c-dda2-e108159ba953
GPU-37d09257-2582-8080-223a-dd5a646fba43
GPU-c71cbb10-4368-d327-e0e5-56372aa4f10f
GPU-a413a75a-15b2-063e-638f-bde063af5c8e
GPU-bf12181a-e615-dcd4-5da2-9a518ae1af5d
GPU-dfec21c4-e30d-5a36-599d-eef2fd354809
GPU-15a11fe2-33f2-cd65-09f0-9897ba057a0c
GPU-2d971e69-8147-8221-a055-e26573950f91
GPU-22ee3c89-fed1-891f-96bb-6bbf27a2cc4b


Job finished at: Tue May 28 13:03:20 EEST 2024

...I'm not interesting on the output of the other 'echo' commands,beside the one with the hostname, that's why I didn't changed.


Best,
Mihai


I will try
On 2024-05-28 12:23, Hermann Schwärzler via slurm-users wrote:

Hi Mihai,

this is a problem that is not Slurm related. It's rather about:
"when does command substitution happen?"

When you write

  srun echo Running on host: $(hostname)
$(hostname) is replaced by the output of the hostname-command*before*
the line is "submitted" to srun. Which means that srun will happily
run it on any (remote) node using the name of the host it is running
on.

If you want to avoid this, one possible solution is

  srun bash -c 'echo Running on host: $(hostname)'

In this case the command substitution is happening after srun starts
the process on a (potentially remote) node.

Regards,
Hermann


On 5/28/24 10:54, Mihai Ciubancan via slurm-users wrote:
Hello,
My name is Mihai and a have an issue with a small GPU cluster managewith slurm 22.05.11. I got 2 different output when I'm trying tofind out the name of the nodes(one correct and one wrong). Thescript is:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=/data/mihai/res.txt
#SBATCH --partition=eli
#SBATCH --nodes=2
srun echo Running on host: $(hostname)
srun hostname
srun sleep 15

And the output look like this:

cat res.txt
Running on host: mihai-x8640
Running on host: mihai-x8640
mihaigpu2
mihai-x8640
As you can see the output of the command 'srun echo Running on host:$(hostname)' is the same, as the jobs was running twice on the samenode, while command 'srun hostname' it's giving me the correctoutput.
Do you have any idea why the outputs of the 2 commands aredifferent?
Thank you,
Mihai


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: sbatch problem

Reply via email to