Hi all, I'm getting errors in MPI_Allreduce what I restart an REMD simulation. It has occurred every time I have attempted an REMD restart. I'm posting here to check there's not something obviously wrong with the way I'm doing the restart which is causing it.
I restart an REMD run using: ----------------------------------------------------------------------------------------------------------------------------------------- basedir=/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_ status=${basedir}/pshsp_andva_run1_.status deffnm=${basedir}/pshsp_andva_run1_ cpt=${basedir}/pshsp_andva_run0_.cpt tpr=${basedir}/pshsp_andva_run0_.tpr log=${basedir}/pshsp_andva_run1_0.log n_procs=32 echo "about to check if log file exists" if [ ! -e $log ]; then echo "RUNNING" > $status source /usr/share/modules/init/bash module load intel-mpi module load intel-mkl module load gromacs echo "Calling mdrun" mpirun -np 32 mdrun-mpi -maxh 24 -multi 16 -replex 1000 -s $tpr -cpi $cpt -deffnm $deffnm retval=$? if [ $retval != 0 ]; then echo "ERROR" > $status exit 1 fi echo "FINISHED" > $status fi exit 0 ------------------------------------------------------------------------------------------------------------------------------------------ mdrun then gets stuck and doesn't output anything until it is terminated by the queuing system. Upon termination the following output is written to stderr. [cli_5]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x2b379c00b770, count=16, MPI_INT, MPI_SUM, MPI_COMM _NULL) failed MPI_Allreduce(1051): Null communicator [cli_31]: [cli_11]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7f489806bf60, count=16, MPI_INT, MPI_SUM, MPI_COMM _NULL) failed MPI_Allreduce(1051): Null communicator aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fd960002fc0, count=16, MPI_INT, MPI_SUM, MPI_COMM _NULL) failed MPI_Allreduce(1051): Null communicator [cli_7]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754400, count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) failed MPI_Allreduce(1051): Null communicator [cli_9]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x757230, count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) failed MPI_Allreduce(1051): Null communicator [cli_27]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fb3cc02a450, count=16, MPI_INT, MPI_SUM, MPI_COMM _NULL) failed MPI_Allreduce(1051): Null communicator [cli_23]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x750970, count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) failed MPI_Allreduce(1051): Null communicator [cli_21]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7007b0, count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) failed MPI_Allreduce(1051): Null communicator [cli_3]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754360, count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) failed MPI_Allreduce(1051): Null communicator [cli_29]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x756460, count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) failed MPI_Allreduce(1051): Null communicator [cli_19]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7f60a0066850, count=16, MPI_INT, MPI_SUM, MPI_COMM _NULL) failed MPI_Allreduce(1051): Null communicator [cli_17]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7f4bdc07b690, count=16, MPI_INT, MPI_SUM, MPI_COMM _NULL) failed MPI_Allreduce(1051): Null communicator [cli_1]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754430, count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) failed MPI_Allreduce(1051): Null communicator [cli_15]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fc31407c830, count=16, MPI_INT, MPI_SUM, MPI_COMM _NULL) failed MPI_Allreduce(1051): Null communicator [cli_25]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6e1830, count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) failed MPI_Allreduce(1051): Null communicator [cli_13]: aborting job: Fatal error in MPI_Allreduce: Invalid communicator, error stack: MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6c2430, count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) failed MPI_Allreduce(1051): Null communicator Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_3.tpr, VERSION 4.5.4 (singl e precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_0.tpr, VERSION 4.5.4 (singl e precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_7.tpr, VERSION 4.5.4 (singl e precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_6.tpr, VERSION 4.5.4 (singl e precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_1.tpr, VERSION 4.5.4 (singl e precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_4.tpr, VERSION 4.5.4 (singl e precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_5.tpr, VERSION 4.5.4 (singl e precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_2.tpr, VERSION 4.5.4 (singl e precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_11.tpr, VERSION 4.5.4 (sing le precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_9.tpr, VERSION 4.5.4 (singl e precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_8.tpr, VERSION 4.5.4 (singl e precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_10.tpr, VERSION 4.5.4 (sing le precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_15.tpr, VERSION 4.5.4 (sing le precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_13.tpr, VERSION 4.5.4 (sing le precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_12.tpr, VERSION 4.5.4 (sing le precision) Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_14.tpr, VERSION 4.5.4 (sing le precision) Terminated Cheers, Ben -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists