Hi, Firstly, you're not using the latest version and there might have been a fix for your issue in the 4.5.5 patch release.
Secondly, you should check the http://redmine.gromacs.org bugtracker to see what bugs have been fixed in 4.5.5 (ideally the target version should tell). You can also just do a search for REMD and see what matching bugs (open or closed) are in the database: http://redmine.gromacs.org/search/index/gromacs?issues=1&q=REMD Cheers, -- Szilárd On Tue, Oct 25, 2011 at 8:04 PM, Ben Reynwar <b...@reynwar.net> wrote: > Hi all, > > I'm getting errors in MPI_Allreduce what I restart an REMD simulation. > It has occurred every time I have attempted an REMD restart. > I'm posting here to check there's not something obviously wrong with > the way I'm doing the restart which is causing it. > > I restart an REMD run using: > > ----------------------------------------------------------------------------------------------------------------------------------------- > basedir=/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_ > status=${basedir}/pshsp_andva_run1_.status > deffnm=${basedir}/pshsp_andva_run1_ > cpt=${basedir}/pshsp_andva_run0_.cpt > tpr=${basedir}/pshsp_andva_run0_.tpr > log=${basedir}/pshsp_andva_run1_0.log > n_procs=32 > > echo "about to check if log file exists" > if [ ! -e $log ]; then > echo "RUNNING" > $status > source /usr/share/modules/init/bash > module load intel-mpi > module load intel-mkl > module load gromacs > echo "Calling mdrun" > mpirun -np 32 mdrun-mpi -maxh 24 -multi 16 -replex 1000 -s $tpr > -cpi $cpt -deffnm $deffnm > retval=$? > if [ $retval != 0 ]; then > echo "ERROR" > $status > exit 1 > fi > echo "FINISHED" > $status > fi > exit 0 > ------------------------------------------------------------------------------------------------------------------------------------------ > > mdrun then gets stuck and doesn't output anything until it is > terminated by the queuing system. > Upon termination the following output is written to stderr. > > [cli_5]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, > rbuf=0x2b379c00b770, count=16, MPI_INT, MPI_SUM, MPI_COMM > _NULL) failed > MPI_Allreduce(1051): Null communicator > [cli_31]: [cli_11]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, > rbuf=0x7f489806bf60, count=16, MPI_INT, MPI_SUM, MPI_COMM > _NULL) failed > MPI_Allreduce(1051): Null communicator > aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, > rbuf=0x7fd960002fc0, count=16, MPI_INT, MPI_SUM, MPI_COMM > _NULL) failed > MPI_Allreduce(1051): Null communicator > [cli_7]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754400, > count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) > failed > MPI_Allreduce(1051): Null communicator > [cli_9]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x757230, > count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) > failed > MPI_Allreduce(1051): Null communicator > [cli_27]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, > rbuf=0x7fb3cc02a450, count=16, MPI_INT, MPI_SUM, MPI_COMM > _NULL) failed > MPI_Allreduce(1051): Null communicator > [cli_23]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x750970, > count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) > failed > MPI_Allreduce(1051): Null communicator > [cli_21]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7007b0, > count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) > failed > MPI_Allreduce(1051): Null communicator > [cli_3]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754360, > count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) > failed > MPI_Allreduce(1051): Null communicator > [cli_29]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x756460, > count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) > failed > MPI_Allreduce(1051): Null communicator > [cli_19]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, > rbuf=0x7f60a0066850, count=16, MPI_INT, MPI_SUM, MPI_COMM > _NULL) failed > MPI_Allreduce(1051): Null communicator > [cli_17]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, > rbuf=0x7f4bdc07b690, count=16, MPI_INT, MPI_SUM, MPI_COMM > _NULL) failed > MPI_Allreduce(1051): Null communicator > [cli_1]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754430, > count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) > failed > MPI_Allreduce(1051): Null communicator > [cli_15]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, > rbuf=0x7fc31407c830, count=16, MPI_INT, MPI_SUM, MPI_COMM > _NULL) failed > MPI_Allreduce(1051): Null communicator > [cli_25]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6e1830, > count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) > failed > MPI_Allreduce(1051): Null communicator > [cli_13]: aborting job: > Fatal error in MPI_Allreduce: Invalid communicator, error stack: > MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6c2430, > count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL) > failed > MPI_Allreduce(1051): Null communicator > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_3.tpr, > VERSION 4.5.4 (singl > e precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_0.tpr, > VERSION 4.5.4 (singl > e precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_7.tpr, > VERSION 4.5.4 (singl > e precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_6.tpr, > VERSION 4.5.4 (singl > e precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_1.tpr, > VERSION 4.5.4 (singl > e precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_4.tpr, > VERSION 4.5.4 (singl > e precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_5.tpr, > VERSION 4.5.4 (singl > e precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_2.tpr, > VERSION 4.5.4 (singl > e precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_11.tpr, > VERSION 4.5.4 (sing > le precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_9.tpr, > VERSION 4.5.4 (singl > e precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_8.tpr, > VERSION 4.5.4 (singl > e precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_10.tpr, > VERSION 4.5.4 (sing > le precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_15.tpr, > VERSION 4.5.4 (sing > le precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_13.tpr, > VERSION 4.5.4 (sing > le precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_12.tpr, > VERSION 4.5.4 (sing > le precision) > Reading file > /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_14.tpr, > VERSION 4.5.4 (sing > le precision) > Terminated > > Cheers, > Ben > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists