On 30/01/2011 10:26 AM, [email protected] wrote:
-----Original Message-----
From: [email protected] [mailto:gmx-users-
[email protected]] On Behalf Of Mark Abraham
Sent: 29 January 2011 08:24
To: Discussion list for GROMACS users
Subject: Re: [gmx-users] Simulation time losses with REMD
On 28/01/2011 4:46 PM, Mark Abraham wrote:
Hi,
I compared the .log file time accounting for same .tpr file run alone
in serial or as part of an REMD simulation (with each replica on a
single proessor). It ran about 5-10% slower in the latter. The effect
was a bit larger when comparing the same .tpr on 8 processors with
REMD with 8 processers per replica. The effect seems fairly
independent of whether I compare the lowest or highest replica.
OK I found the issue by binary-searching the code looking for the
offending line. It's in compute_globals() in src/kernel/md.c. The call
to gmx_sum_sim consumes all the extra time. This code is taking care of
synchronization for possibly doing checkpointing.
if (MULTISIM(cr)&& bInterSimGS)
{
if (MASTER(cr))
{
/* Communicate the signals between the
simulations */
gmx_sum_sim(eglsNR,gs_buf,cr->ms);
}
/* Communicate the signals form the master to the
others */
gmx_bcast(eglsNR*sizeof(gs_buf[0]),gs_buf,cr);
}
This eventually calls
void gmx_sumf_comm(int nr,float r[],MPI_Comm mpi_comm)
{
#if defined(MPI_IN_PLACE_EXISTS) || defined(GMX_THREADS)
MPI_Allreduce(MPI_IN_PLACE,r,nr,MPI_FLOAT,MPI_SUM,mpi_comm);
#else
/* this function is only used in code that is not performance
critical,
(during setup, when comm_rec is not the appropriate
communication
structure), so this isn't as bad as it looks. */
float *buf;
int i;
snew(buf, nr);
MPI_Allreduce(r,buf,nr,MPI_FLOAT,MPI_SUM,mpi_comm);
for(i=0; i<nr; i++)
r[i] = buf[i];
sfree(buf);
#endif
}
Clearly the comment is out of date. My nstlist=5, repl_ex_nst=2500 and
nstcalcenergy=-1, so that triggers gs.nstms=5 and so bInterSimGS is
TRUE
every 5 steps. I'm not sure whether the problem is with nstlist, or the
multi-simulation checkpointing engineering, or what.
Mark
So are you saying that this code itself is slow (and called frequently), or
this is showing the latency in synchronising replicas? If the latter, then
presumably if you comment this out (or adjust nstlist or whatever), then it
will just defer to the latency to the REMD call itself?
(I'll check my own example in due course, but our systems happen to be down
this weekend.)
I've already controlled for the REMD cost and latency. The issue is what
is causing the extra delay.
I've worked out what the issue is, and I'll move this thread to a
Redmine issue - http://redmine.gromacs.org/issues/691
Mark
--
gmx-users mailing list [email protected]
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [email protected].
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists