Well, does it also crash when you run it with two nodes in a normal way (not using heterogeneous jobs) ?
#!/bin/bash #SBATCH --job-name=myQE_2Nx2MPI #SBATCH --output=big-mem #SBATCH --nodes=2 #SBATCH --ntasks-per-node=2 #SBATCH --mem-per-cpu=16g #SBATCH --partition=QUARTZ #SBATCH --account=z5 # srun pw.x -i mos2.rlx.in Le jeu. 28 mars 2019 à 16:57, Mahmood Naderan <mahmood...@gmail.com> a écrit : > BTW, when I manually run on a node, e.g. compute-0-2, I get this output > > > ]$ mpirun -np 4 pw.x -i mos2.rlx.in > > Program PWSCF v.6.2 starts on 28Mar2019 at 11:40:36 > > This program is part of the open-source Quantum ESPRESSO suite > for quantum simulation of materials; please cite > "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009); > "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017); > URL http://www.quantum-espresso.org", > in publications or presentations arising from this work. More details > at > http://www.quantum-espresso.org/quote > > Parallel version (MPI), running on 4 processors > > MPI processes distributed on 1 nodes > R & G space division: proc/nbgrp/npool/nimage = 4 > Reading input from mos2.rlx.in > Warning: card &CELL ignored > Warning: card CELL_DYNAMICS = "BFGS" ignored > Warning: card PRESS_CONV_THR = 5.00000E-01 ignored > Warning: card / ignored > > Current dimensions of program PWSCF are: > Max number of different atomic species (ntypx) = 10 > Max number of k-points (npk) = 40000 > Max angular momentum in pseudopotentials (lmaxx) = 3 > file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s) > 4S renormalized > > Subspace diagonalization in iterative solution of the eigenvalue > problem: > a serial algorithm will be used > > Found symmetry operation: I + ( 0.0000 0.1667 0.0000) > ... > ... > ... > > > Regards, > Mahmood > > > > > On Thu, Mar 28, 2019 at 8:23 PM Mahmood Naderan <mahmood...@gmail.com> > wrote: > >> The run is not consistent. I have manually test "mpirun -np 4 pw.x -i >> mos2.rlx.in" on compute-0-2 and rocks7 nodes and it is fine. >> However, with the script "srun --pack-group=0 --ntasks=2 : --pack-group=1 >> --ntasks=4 pw.x -i mos2.rlx.in" I see some errors in the output file >> which results in abortion after 60 seconds. >> >> The errors are about not finding some files. Although the config file >> uses absolute path for the intermediate files and files are existed, the >> errors sound bizarre. >> >> >> compute-0-2 >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >> COMMAND >> 3387 ghatee 20 0 1930488 129684 8336 R 100.0 0.2 0:09.71 pw.x >> 3388 ghatee 20 0 1930476 129700 8336 R 99.7 0.2 0:09.68 pw.x >> >> >> >> rocks7 >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >> COMMAND >> 5592 ghatee 20 0 1930568 127764 8336 R 100.0 0.2 0:17.29 pw.x >> 549 ghatee 20 0 116844 3652 1804 S 0.0 0.0 0:00.14 bash >> >> >> >> As you can see, 2 tasks are fine on compute-0-2, but there should be 4 >> tasks on rocks7. >> Input file contains >> outdir = "/home/ghatee/job/2h-unitcell" , >> pseudo_dir = "/home/ghatee/q-e-qe-5.4/pseudo/" , >> >> >> The output file says >> >> Program PWSCF v.6.2 starts on 28Mar2019 at 11:43:58 >> >> This program is part of the open-source Quantum ESPRESSO suite >> for quantum simulation of materials; please cite >> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009); >> "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017); >> URL http://www.quantum-espresso.org", >> in publications or presentations arising from this work. More >> details at >> http://www.quantum-espresso.org/quote >> >> Parallel version (MPI), running on 1 processors >> >> MPI processes distributed on 1 nodes >> Reading input from mos2.rlx.in >> Warning: card &CELL ignored >> Warning: card CELL_DYNAMICS = "BFGS" ignored >> Warning: card PRESS_CONV_THR = 5.00000E-01 ignored >> Warning: card / ignored >> >> Program PWSCF v.6.2 starts on 28Mar2019 at 11:43:58 >> >> This program is part of the open-source Quantum ESPRESSO suite >> for quantum simulation of materials; please cite >> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009); >> "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017); >> URL http://www.quantum-espresso.org", >> in publications or presentations arising from this work. More >> details at >> http://www.quantum-espresso.org/quote >> >> Parallel version (MPI), running on 1 processors >> >> MPI processes distributed on 1 nodes >> Reading input from mos2.rlx.in >> Warning: card &CELL ignored >> Warning: card CELL_DYNAMICS = "BFGS" ignored >> Warning: card PRESS_CONV_THR = 5.00000E-01 ignored >> Warning: card / ignored >> >> Current dimensions of program PWSCF are: >> Max number of different atomic species (ntypx) = 10 >> Max number of k-points (npk) = 40000 >> Max angular momentum in pseudopotentials (lmaxx) = 3 >> >> Current dimensions of program PWSCF are: >> Max number of different atomic species (ntypx) = 10 >> Max number of k-points (npk) = 40000 >> Max angular momentum in pseudopotentials (lmaxx) = 3 >> >> Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58 >> >> This program is part of the open-source Quantum ESPRESSO suite >> for quantum simulation of materials; please cite >> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009); >> "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017); >> URL http://www.quantum-espresso.org", >> in publications or presentations arising from this work. More >> details at >> http://www.quantum-espresso.org/quote >> >> Parallel version (MPI), running on 1 processors >> >> MPI processes distributed on 1 nodes >> Reading input from mos2.rlx.in >> Warning: card &CELL ignored >> Warning: card CELL_DYNAMICS = "BFGS" ignored >> Warning: card PRESS_CONV_THR = 5.00000E-01 ignored >> Warning: card / ignored >> >> Current dimensions of program PWSCF are: >> Max number of different atomic species (ntypx) = 10 >> Max number of k-points (npk) = 40000 >> Max angular momentum in pseudopotentials (lmaxx) = 3 >> >> Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58 >> >> This program is part of the open-source Quantum ESPRESSO suite >> for quantum simulation of materials; please cite >> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009); >> "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017); >> URL http://www.quantum-espresso.org", >> in publications or presentations arising from this work. More >> details at >> http://www.quantum-espresso.org/quote >> >> Parallel version (MPI), running on 1 processors >> >> MPI processes distributed on 1 nodes >> Reading input from mos2.rlx.in >> Warning: card &CELL ignored >> Warning: card CELL_DYNAMICS = "BFGS" ignored >> Warning: card PRESS_CONV_THR = 5.00000E-01 ignored >> Warning: card / ignored >> >> Current dimensions of program PWSCF are: >> Max number of different atomic species (ntypx) = 10 >> Max number of k-points (npk) = 40000 >> Max angular momentum in pseudopotentials (lmaxx) = 3 >> >> Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58 >> >> This program is part of the open-source Quantum ESPRESSO suite >> for quantum simulation of materials; please cite >> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009); >> "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017); >> URL http://www.quantum-espresso.org", >> in publications or presentations arising from this work. More >> details at >> http://www.quantum-espresso.org/quote >> >> Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58 >> >> This program is part of the open-source Quantum ESPRESSO suite >> for quantum simulation of materials; please cite >> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009); >> "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017); >> URL http://www.quantum-espresso.org", >> in publications or presentations arising from this work. More >> details at >> http://www.quantum-espresso.org/quote >> >> Parallel version (MPI), running on 1 processors >> >> MPI processes distributed on 1 nodes >> >> Parallel version (MPI), running on 1 processors >> >> MPI processes distributed on 1 nodes >> Reading input from mos2.rlx.in >> Reading input from mos2.rlx.in >> Warning: card &CELL ignored >> Warning: card CELL_DYNAMICS = "BFGS" ignored >> Warning: card &CELL ignored >> Warning: card CELL_DYNAMICS = "BFGS" ignored >> Warning: card PRESS_CONV_THR = 5.00000E-01 ignored >> Warning: card / ignored >> Warning: card PRESS_CONV_THR = 5.00000E-01 ignored >> Warning: card / ignored >> >> Current dimensions of program PWSCF are: >> Max number of different atomic species (ntypx) = 10 >> Max number of k-points (npk) = 40000 >> Max angular momentum in pseudopotentials (lmaxx) = 3 >> >> Current dimensions of program PWSCF are: >> Max number of different atomic species (ntypx) = 10 >> Max number of k-points (npk) = 40000 >> Max angular momentum in pseudopotentials (lmaxx) = 3 >> file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s) >> 4S renormalized >> file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s) >> 4S renormalized >> file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s) >> 4S renormalized >> file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s) >> 4S renormalized >> file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s) >> 4S renormalized >> file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s) >> 4S renormalized >> ERROR(FoX) >> Cannot open file >> ERROR(FoX) >> Cannot open file >> >> >> >> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% >> Error in routine read_ncpp (2): >> pseudo file is empty or wrong >> >> >> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% >> >> stopping ... >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >> with errorcode 1. >> ... >> ... >> ... >> >> >> >> >> >> Verifying that there are 6 " Parallel version (MPI), running on 1 >> processors" lines, it seems that it starts normally as I specified in the >> slurm script. However, I am suspecious that the program is NOT multicore >> MPI job. It is 6 instances of a serial run and there may be some races >> during the run. >> Any thought? >> >> Regards, >> Mahmood >> >> >> >> >> On Thu, Mar 28, 2019 at 3:59 PM Frava <fravad...@gmail.com> wrote: >> >>> I didn't receive the last mail from Mahmood but Marcus is right, >>> Mahmood's heterogeneous job submission seems to be working now. >>> Well, separating each pack in the srun command and asking for the >>> correct number of tasks to be launched for each pack is the way I figured >>> the heterogeneous jobs worked with SLURM v18.08.0 (I didn't test it with >>> more recent SLURM versions). >>> >>>