On 26/10/2011 6:06 AM, Szilárd Páll wrote:
Hi,

Firstly, you're not using the latest version and there might have been
a fix for your issue in the 4.5.5 patch release.

There was a bug in 4.5.5 that was not present in 4.5.4 that could have produced such symptoms, but it was fixed without creating a Redmine issue.

Secondly, you should check the http://redmine.gromacs.org bugtracker
to see what bugs have been fixed in 4.5.5 (ideally the target version
should tell). You can also just do a search for REMD and see what
matching bugs (open or closed) are in the database:
http://redmine.gromacs.org/search/index/gromacs?issues=1&q=REMD

If the OP is right and this was with 4.5.4 and can be reproduced with 4.5.5, please do some testing (e.g. Do different parallel regimes produce the same symptoms? Can the individual replicas run in a non-REMD simulation?) and file a Redmine issue with your observations and a small sample case.

Mark


Cheers,
--
Szilárd



On Tue, Oct 25, 2011 at 8:04 PM, Ben Reynwar<b...@reynwar.net>  wrote:
Hi all,

I'm getting errors in MPI_Allreduce what I restart an REMD simulation.
  It has occurred every time I have attempted an REMD restart.
I'm posting here to check there's not something obviously wrong with
the way I'm doing the restart which is causing it.

I restart an REMD run using:

-----------------------------------------------------------------------------------------------------------------------------------------
basedir=/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_
status=${basedir}/pshsp_andva_run1_.status
deffnm=${basedir}/pshsp_andva_run1_
cpt=${basedir}/pshsp_andva_run0_.cpt
tpr=${basedir}/pshsp_andva_run0_.tpr
log=${basedir}/pshsp_andva_run1_0.log
n_procs=32

echo "about to check if log file exists"
if [ ! -e $log ]; then
    echo "RUNNING">  $status
    source /usr/share/modules/init/bash
    module load intel-mpi
    module load intel-mkl
    module load gromacs
    echo "Calling mdrun"
    mpirun -np 32 mdrun-mpi -maxh 24 -multi 16 -replex 1000 -s $tpr
-cpi $cpt -deffnm $deffnm
    retval=$?
    if [ $retval != 0 ]; then
        echo "ERROR">  $status
        exit 1
    fi
    echo "FINISHED">  $status
fi
exit 0
------------------------------------------------------------------------------------------------------------------------------------------

mdrun then gets stuck and doesn't output anything until it is
terminated by the queuing system.
Upon termination the following output is written to stderr.

[cli_5]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x2b379c00b770, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_31]: [cli_11]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7f489806bf60, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7fd960002fc0, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_7]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754400,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
  failed
MPI_Allreduce(1051): Null communicator
[cli_9]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x757230,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
  failed
MPI_Allreduce(1051): Null communicator
[cli_27]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7fb3cc02a450, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_23]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x750970,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
  failed
MPI_Allreduce(1051): Null communicator
[cli_21]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7007b0,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
  failed
MPI_Allreduce(1051): Null communicator
[cli_3]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754360,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
  failed
MPI_Allreduce(1051): Null communicator
[cli_29]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x756460,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
  failed
MPI_Allreduce(1051): Null communicator
[cli_19]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7f60a0066850, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_17]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7f4bdc07b690, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_1]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754430,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
  failed
MPI_Allreduce(1051): Null communicator
[cli_15]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7fc31407c830, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_25]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6e1830,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
  failed
MPI_Allreduce(1051): Null communicator
[cli_13]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6c2430,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
  failed
MPI_Allreduce(1051): Null communicator
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_3.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_0.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_7.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_6.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_1.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_4.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_5.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_2.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_11.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_9.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_8.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_10.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_15.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_13.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_12.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file 
/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_14.tpr,
VERSION 4.5.4 (sing
le precision)
Terminated

Cheers,
Ben
--
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


--
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Reply via email to