I'm using this TensorRT tutorial <https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleMovieLensMPS> with MPS on Slurm 20.02 on Bright Cluster 8.2
Here are the contents of my mpsmovietest sbatch file: #!/bin/bash #SBATCH --nodes=1 #SBATCH --job-name=MPSMovieTest #SBATCH --gres=gpu:1 #SBATCH --nodelist=node001 #SBATCH --output=mpstest.out export CUDA_VISIBLE_DEVICES=0 nvidia-smi -i 0 export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log nvidia-cuda-mps-control -d module load shared slurm openmpi/cuda/64 cm-ml-python3deps/3.2.3 cudnn/7.0 slurm cuda10.1/toolkit ml-pythondeps-py36-cuda10.1-gcc/3.2.3 tensorflow-py36-cuda10.1-gcc tensorrt-cuda10.1-gcc/6.0.1.5 gcc gdb keras-py36-cuda10.1-gcc nccl2-cuda10.1-gcc /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2 When run in Slurm I get the below errors so perhaps there is a pathing issue that does not work when I run srun alone: Could not find movielens_ratings.txt in data directories: data/samples/movielens/ data/movielens/ &&&& FAILED I’m trying to use srun to test this but it always fails as it appears to be trying all nodes. We only have 3 compute nodes. As I’m writing this node002 and node003 are in use by other users so I just want to use node001. srun /home/mydir/mpsmovietest --gres=gpu:1 --job-name=MPSMovieTest --nodes=1 --nodelist=node001 -Z --output=mpstest.out Tue Apr 14 16:45:10 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 | | N/A 67C P0 241W / 250W | 32167MiB / 32510MiB | 100% E. Process | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 428996 C python3.6 32151MiB | +-----------------------------------------------------------------------------+ Loading openmpi/cuda/64/3.1.4 Loading requirement: hpcx/2.4.0 gcc5/5.5.0 Loading cm-ml-python3deps/3.2.3 Loading requirement: python36 Loading tensorflow-py36-cuda10.1-gcc/1.15.2 Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20 keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6 &&&& RUNNING TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2 [03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt [E] [TRT] CUDA initialization failure with error 999. Please check your CUDA installation: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html [E] Could not create builder. [03/14/2020-16:45:10] [03/14/2020-16:45:10] &&&& FAILED TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2 srun: error: node002: task 0: Exited with exit code 1 So is my syntax wrong with srun? MPS is running: $ ps -auwx|grep mps root 108581 0.0 0.0 12780 812 ? Ssl Mar23 0:54 /cm/local/apps/cuda- When node002 is available the program runs correctly, albeit with an error about the log file failing to write: srun /home/mydir/mpsmovietest --gres=gpu:1 --job-name=MPSMovieTest --nodes=1 --nodelist=node001 -Z --output=mpstest.out Thu Apr 16 10:08:52 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 | | N/A 28C P0 25W / 250W | 41MiB / 32510MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 420596 C nvidia-cuda-mps-server 29MiB | +-----------------------------------------------------------------------------+ Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs will be available. An instance of this daemon is already running Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs will be available. Loading openmpi/cuda/64/3.1.4 Loading requirement: hpcx/2.4.0 gcc5/5.5.0 Loading cm-ml-python3deps/3.2.3 Loading requirement: python36 Loading tensorflow-py36-cuda10.1-gcc/1.15.2 Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20 keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6 &&&& RUNNING TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2 [03/16/2020-10:08:52] [I] ../../../data/movielens/movielens_ratings.txt [03/16/2020-10:08:53] [I] Begin parsing model... [03/16/2020-10:08:53] [I] End parsing model... [W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5 [03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] [TRT] Detected 2 inputs and 3 output network tensors. [W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5 [03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] End building engine... [W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5 [W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5 [03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done execution in process: 99395 . Duration : 315.744 microseconds. [03/16/2020-10:09:01] [I] Num of users : 2 [03/16/2020-10:09:01] [I] Num of Movies : 100 [03/16/2020-10:09:01] [I] | PID : 99395 | User : 0 | Expected Item : 128 | Predicted Item : 128 | [03/16/2020-10:09:01] [I] | PID : 99395 | User : 1 | Expected Item : 133 | Predicted Item : 133 | [W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5 [W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5 [03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done execution in process: 99396 . Duration : 306.944 microseconds. [03/16/2020-10:09:01] [I] Num of users : 2 [03/16/2020-10:09:01] [I] Num of Movies : 100 [03/16/2020-10:09:01] [I] | PID : 99396 | User : 0 | Expected Item : 128 | Predicted Item : 128 | [03/16/2020-10:09:01] [I] | PID : 99396 | User : 1 | Expected Item : 133 | Predicted Item : 133 | [03/16/2020-10:09:02] [I] Number of processes executed : 2. Total MPS Run Duration : 4361.73 milliseconds. &&&& PASSED TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2 Is something incorrect in the sbatch file? Thanks! Rob