Dear there, I have a cluster with 9 nodes(cmbc[1530-1538]) , each node has 2 cpus and each cpu has 32cores, but when I submitted a heterogeneous job twice ,the second job terminated unexpectedly. This problem has been bothering me all day. Slurm version is 18.08.5 and here is the job : ************** #!/bin/bash #SBATCH -J FIRE #SBATCH -o log.heter.%j #SBATCH -e log.heter.%j #SBATCH --comment=WRF #SBATCH --mem=20G #SBATCH -p largemem #SBATCH -n 64 -N 2 #SBATCH packjob #SBATCH -J HAHA1 #SBATCH -p largemem #SBATCH -n 16 -N 1 #SBATCH --mem=20G #SBATCH packjob #SBATCH -J HAHA2 #SBATCH -w cmbc1533 #SBATCH -p largemem #SBATCH -n 8 -N 1 #SBATCH --mem=20G
module load compiler/intel/composer_xe_2018.1.163 module load mpi/intelmpi/2018.1 export I_MPI_PMI_LIBRARY=/opt/slurm18/lib/libpmi.so time srun --mpi=pmi2 ./inter_fire 960000000 : ./intel_fire 960000000 : ./intel_fire 960000000 date ********************* Here is the error of the terminated job : Appreciatively, Menglong 祝工作顺利! 姓名 胡梦龙 手机 135-6164-9610 部门 HPC 中科曙光国际信息产业有限公司 青岛市崂山区株洲路78号中科曙光大厦(3号楼) 266000