[slurm-users] Re: Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.

Hongyi Zhao via slurm-users Fri, 24 May 2024 18:50:13 -0700

On Sat, May 25, 2024 at 7:50 AM Hongyi Zhao <hongyi.z...@gmail.com> wrote:
>
> On Sat, May 25, 2024 at 12:02 AM Hermann Schwärzler via slurm-users
> <slurm-users@lists.schedmd.com> wrote:
> >
> > Hi Zhao,
> >
> > my guess is that in your faster case you are using hyperthreading
> > whereas in the Slurm case you don't.
> >
> > Can you check what performance you get when you add
> >
> > #SBATCH --hint=multithread
> >
> > to you slurm script?
>
> I tried to add the above instructions to the slurm script, and only
> found that the job will stuck there forever. Here are the results 10
> minutes after the job was submitted:
>
>
> werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
> cat sub.sh.o6
> #######################################################
> date                    = 2024年 05月 25日 星期六 07:31:31 CST
> hostname                = x13dai-t
> pwd                     =
> /home/werner/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV
> sbatch                  = /usr/bin/sbatch
>
> WORK_DIR                =
> SLURM_SUBMIT_DIR        =
> /home/werner/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV
> SLURM_JOB_NUM_NODES     = 1
> SLURM_NTASKS            = 36
> SLURM_NTASKS_PER_NODE   =
> SLURM_CPUS_PER_TASK     =
> SLURM_JOBID             = 6
> SLURM_JOB_NODELIST      = localhost
> SLURM_NNODES            = 1
> SLURMTMPDIR             =
> #######################################################
>
>  running   36 mpi-ranks, on    1 nodes
>  distrk:  each k-point on   36 cores,    1 groups
>  distr:  one band on    4 cores,    9 groups
>  vasp.6.4.3 19Mar24 (build May 17 2024 09:27:19) complex
>
>  POSCAR found type information on POSCAR Cr
>  POSCAR found :  1 types and      72 ions
>  Reading from existing POTCAR
>  scaLAPACK will be used
>  Reading from existing POTCAR
>  -----------------------------------------------------------------------------
> |                                                                             
> |
> |               ----> ADVICE to this user running VASP <----                  
> |
> |                                                                             
> |
> |     You have a (more or less) 'large supercell' and for larger cells it     
> |
> |     might be more efficient to use real-space projection operators.         
> |
> |     Therefore, try LREAL= Auto in the INCAR file.                           
> |
> |     Mind: For very accurate calculation, you might also keep the            
> |
> |     reciprocal projection scheme (i.e. LREAL=.FALSE.).                      
> |
> |                                                                             
> |
>  -----------------------------------------------------------------------------
>
>  LDA part: xc-table for (Slater+PW92), standard interpolation
>  POSCAR, INCAR and KPOINTS ok, starting setup
>  FFT: planning ... GRIDC
>  FFT: planning ... GRID_SOFT
>  FFT: planning ... GRID
>  WAVECAR not read


Ultimately, I found that the cause of the problem was that
hyper-threading was enabled by default in the BIOS. If I disable
hyper-threading, I observed that the computational efficiency is
consistent between using slurm and using mpirun directly. Therefore,
it appears that hyper-threading should not be enabled in the BIOS when
using slurm.

>
> > Another difference between the two might be
> > a) the communication channel/interface that is used.
>
> I tried to use `mpirun', `mpiexec', and `srun --mpi pmi2', and they
> all have similar behaviors as described above.
>
> > b) the number of nodes involved: when using mpirun you might run things
> > on more than one node.
>
> This is a single-node cluster with two sockets.
>
> > Regards,
> > Hermann
>
> Regards,
> Zhao
>
> > On 5/24/24 15:32, Hongyi Zhao via slurm-users wrote:
> > > Dear Slurm Users,
> > >
> > > I am experiencing a significant performance discrepancy when running
> > > the same VASP job through the Slurm scheduler compared to running it
> > > directly with mpirun. I am hoping for some insights or advice on how
> > > to resolve this issue.
> > >
> > > System Information:
> > >
> > > Slurm Version: 21.08.5
> > > OS: Ubuntu 22.04.4 LTS (Jammy)
> > >
> > >
> > > Job Submission Script:
> > >
> > > #!/usr/bin/env bash
> > > #SBATCH -N 1
> > > #SBATCH -D .
> > > #SBATCH --output=%j.out
> > > #SBATCH --error=%j.err
> > > ##SBATCH --time=2-00:00:00
> > > #SBATCH --ntasks=36
> > > #SBATCH --mem=64G
> > >
> > > echo '#######################################################'
> > > echo "date                    = $(date)"
> > > echo "hostname                = $(hostname -s)"
> > > echo "pwd                     = $(pwd)"
> > > echo "sbatch                  = $(which sbatch | xargs realpath -e)"
> > > echo ""
> > > echo "WORK_DIR                = $WORK_DIR"
> > > echo "SLURM_SUBMIT_DIR        = $SLURM_SUBMIT_DIR"
> > > echo "SLURM_JOB_NUM_NODES     = $SLURM_JOB_NUM_NODES"
> > > echo "SLURM_NTASKS            = $SLURM_NTASKS"
> > > echo "SLURM_NTASKS_PER_NODE   = $SLURM_NTASKS_PER_NODE"
> > > echo "SLURM_CPUS_PER_TASK     = $SLURM_CPUS_PER_TASK"
> > > echo "SLURM_JOBID             = $SLURM_JOBID"
> > > echo "SLURM_JOB_NODELIST      = $SLURM_JOB_NODELIST"
> > > echo "SLURM_NNODES            = $SLURM_NNODES"
> > > echo "SLURMTMPDIR             = $SLURMTMPDIR"
> > > echo '#######################################################'
> > > echo ""
> > >
> > > module purge > /dev/null 2>&1
> > > module load vasp
> > > ulimit -s unlimited
> > > mpirun vasp_std
> > >
> > >
> > > Performance Observation:
> > >
> > > When running the job through Slurm:
> > >
> > > werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
> > > grep LOOP OUTCAR
> > >        LOOP:  cpu time     14.4893: real time     14.5049
> > >        LOOP:  cpu time     14.3538: real time     14.3621
> > >        LOOP:  cpu time     14.3870: real time     14.3568
> > >        LOOP:  cpu time     15.9722: real time     15.9018
> > >        LOOP:  cpu time     16.4527: real time     16.4370
> > >        LOOP:  cpu time     16.7918: real time     16.7781
> > >        LOOP:  cpu time     16.9797: real time     16.9961
> > >        LOOP:  cpu time     15.9762: real time     16.0124
> > >        LOOP:  cpu time     16.8835: real time     16.9008
> > >        LOOP:  cpu time     15.2828: real time     15.2921
> > >       LOOP+:  cpu time    176.0917: real time    176.0755
> > >
> > > When running the job directly with mpirun:
> > >
> > >
> > > werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
> > > mpirun -n 36 vasp_std
> > > werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
> > > grep LOOP OUTCAR
> > >        LOOP:  cpu time      9.0072: real time      9.0074
> > >        LOOP:  cpu time      9.0515: real time      9.0524
> > >        LOOP:  cpu time      9.1896: real time      9.1907
> > >        LOOP:  cpu time     10.1467: real time     10.1479
> > >        LOOP:  cpu time     10.2691: real time     10.2705
> > >        LOOP:  cpu time     10.4330: real time     10.4340
> > >        LOOP:  cpu time     10.9049: real time     10.9055
> > >        LOOP:  cpu time      9.9718: real time      9.9714
> > >        LOOP:  cpu time     10.4511: real time     10.4470
> > >        LOOP:  cpu time      9.4621: real time      9.4584
> > >       LOOP+:  cpu time    110.0790: real time    110.0739
> > >
> > >
> > > Could you provide any insights or suggestions on what might be causing
> > > this performance issue? Are there any specific configurations or
> > > settings in Slurm that I should check or adjust to align the
> > > performance more closely with the direct mpirun execution?
> > >
> > > Thank you for your time and assistance.
> > >
> > > Best regards,
> > > Zhao
> >
> > --
> > slurm-users mailing list -- slurm-users@lists.schedmd.com
> > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.

Reply via email to