[slurm-users] Re: Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.

Hongyi Zhao via slurm-users Fri, 24 May 2024 16:51:30 -0700

On Sat, May 25, 2024 at 12:02 AM Hermann Schwärzler via slurm-users
<slurm-users@lists.schedmd.com> wrote:
>
> Hi Zhao,
>
> my guess is that in your faster case you are using hyperthreading
> whereas in the Slurm case you don't.
>
> Can you check what performance you get when you add
>
> #SBATCH --hint=multithread
>
> to you slurm script?


I tried to add the above instructions to the slurm script, and only
found that the job will stuck there forever. Here are the results 10
minutes after the job was submitted:


werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
cat sub.sh.o6
#######################################################
date                    = 2024年 05月 25日 星期六 07:31:31 CST
hostname                = x13dai-t
pwd                     =
/home/werner/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV
sbatch                  = /usr/bin/sbatch

WORK_DIR                =
SLURM_SUBMIT_DIR        =
/home/werner/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV
SLURM_JOB_NUM_NODES     = 1
SLURM_NTASKS            = 36
SLURM_NTASKS_PER_NODE   =
SLURM_CPUS_PER_TASK     =
SLURM_JOBID             = 6
SLURM_JOB_NODELIST      = localhost
SLURM_NNODES            = 1
SLURMTMPDIR             =
#######################################################

 running   36 mpi-ranks, on    1 nodes
 distrk:  each k-point on   36 cores,    1 groups
 distr:  one band on    4 cores,    9 groups
 vasp.6.4.3 19Mar24 (build May 17 2024 09:27:19) complex

 POSCAR found type information on POSCAR Cr
 POSCAR found :  1 types and      72 ions
 Reading from existing POTCAR
 scaLAPACK will be used
 Reading from existing POTCAR
 -----------------------------------------------------------------------------
|                                                                             |
|               ----> ADVICE to this user running VASP <----                  |
|                                                                             |
|     You have a (more or less) 'large supercell' and for larger cells it     |
|     might be more efficient to use real-space projection operators.         |
|     Therefore, try LREAL= Auto in the INCAR file.                           |
|     Mind: For very accurate calculation, you might also keep the            |
|     reciprocal projection scheme (i.e. LREAL=.FALSE.).                      |
|                                                                             |
 -----------------------------------------------------------------------------

 LDA part: xc-table for (Slater+PW92), standard interpolation
 POSCAR, INCAR and KPOINTS ok, starting setup
 FFT: planning ... GRIDC
 FFT: planning ... GRID_SOFT
 FFT: planning ... GRID
 WAVECAR not read


> Another difference between the two might be
> a) the communication channel/interface that is used.

I tried to use `mpirun', `mpiexec', and `srun --mpi pmi2', and they
all have similar behaviors as described above.

> b) the number of nodes involved: when using mpirun you might run things
> on more than one node.

This is a single-node cluster with two sockets.

> Regards,
> Hermann

Regards,
Zhao

> On 5/24/24 15:32, Hongyi Zhao via slurm-users wrote:
> > Dear Slurm Users,
> >
> > I am experiencing a significant performance discrepancy when running
> > the same VASP job through the Slurm scheduler compared to running it
> > directly with mpirun. I am hoping for some insights or advice on how
> > to resolve this issue.
> >
> > System Information:
> >
> > Slurm Version: 21.08.5
> > OS: Ubuntu 22.04.4 LTS (Jammy)
> >
> >
> > Job Submission Script:
> >
> > #!/usr/bin/env bash
> > #SBATCH -N 1
> > #SBATCH -D .
> > #SBATCH --output=%j.out
> > #SBATCH --error=%j.err
> > ##SBATCH --time=2-00:00:00
> > #SBATCH --ntasks=36
> > #SBATCH --mem=64G
> >
> > echo '#######################################################'
> > echo "date                    = $(date)"
> > echo "hostname                = $(hostname -s)"
> > echo "pwd                     = $(pwd)"
> > echo "sbatch                  = $(which sbatch | xargs realpath -e)"
> > echo ""
> > echo "WORK_DIR                = $WORK_DIR"
> > echo "SLURM_SUBMIT_DIR        = $SLURM_SUBMIT_DIR"
> > echo "SLURM_JOB_NUM_NODES     = $SLURM_JOB_NUM_NODES"
> > echo "SLURM_NTASKS            = $SLURM_NTASKS"
> > echo "SLURM_NTASKS_PER_NODE   = $SLURM_NTASKS_PER_NODE"
> > echo "SLURM_CPUS_PER_TASK     = $SLURM_CPUS_PER_TASK"
> > echo "SLURM_JOBID             = $SLURM_JOBID"
> > echo "SLURM_JOB_NODELIST      = $SLURM_JOB_NODELIST"
> > echo "SLURM_NNODES            = $SLURM_NNODES"
> > echo "SLURMTMPDIR             = $SLURMTMPDIR"
> > echo '#######################################################'
> > echo ""
> >
> > module purge > /dev/null 2>&1
> > module load vasp
> > ulimit -s unlimited
> > mpirun vasp_std
> >
> >
> > Performance Observation:
> >
> > When running the job through Slurm:
> >
> > werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
> > grep LOOP OUTCAR
> >        LOOP:  cpu time     14.4893: real time     14.5049
> >        LOOP:  cpu time     14.3538: real time     14.3621
> >        LOOP:  cpu time     14.3870: real time     14.3568
> >        LOOP:  cpu time     15.9722: real time     15.9018
> >        LOOP:  cpu time     16.4527: real time     16.4370
> >        LOOP:  cpu time     16.7918: real time     16.7781
> >        LOOP:  cpu time     16.9797: real time     16.9961
> >        LOOP:  cpu time     15.9762: real time     16.0124
> >        LOOP:  cpu time     16.8835: real time     16.9008
> >        LOOP:  cpu time     15.2828: real time     15.2921
> >       LOOP+:  cpu time    176.0917: real time    176.0755
> >
> > When running the job directly with mpirun:
> >
> >
> > werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
> > mpirun -n 36 vasp_std
> > werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
> > grep LOOP OUTCAR
> >        LOOP:  cpu time      9.0072: real time      9.0074
> >        LOOP:  cpu time      9.0515: real time      9.0524
> >        LOOP:  cpu time      9.1896: real time      9.1907
> >        LOOP:  cpu time     10.1467: real time     10.1479
> >        LOOP:  cpu time     10.2691: real time     10.2705
> >        LOOP:  cpu time     10.4330: real time     10.4340
> >        LOOP:  cpu time     10.9049: real time     10.9055
> >        LOOP:  cpu time      9.9718: real time      9.9714
> >        LOOP:  cpu time     10.4511: real time     10.4470
> >        LOOP:  cpu time      9.4621: real time      9.4584
> >       LOOP+:  cpu time    110.0790: real time    110.0739
> >
> >
> > Could you provide any insights or suggestions on what might be causing
> > this performance issue? Are there any specific configurations or
> > settings in Slurm that I should check or adjust to align the
> > performance more closely with the direct mpirun execution?
> >
> > Thank you for your time and assistance.
> >
> > Best regards,
> > Zhao
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.

Reply via email to