ingpong latencies, no
discernable bandwidth improvements).
- Jonathan
--
Jonathan Dursi SciNet, Compute/Calcul Canada
i)
end do
[...]
When I am reading the data in again and print them out, I always have:
buf(0)=0
If you compile your code with -check bounds and run, you'll get an error
pointing out that buf(0) is an illegal access; in Fortran arrays start at 1.
- Jonathan
--
Jonathan Dursi | SciNet, Compute/Calcul Canada
For what it's worth, 1.4.4 built with the intel 12.1.0.233 compilers has been
the default mpi at our centre for over a month and we haven't had any
problems...
- jonathan
--
Jonathan Dursi; SciNet, Compute/Calcul Canada
-Original Message-
From: Richard Walsh
Sender:
allgatherv with an allgather.
- Jonathan
--
Jonathan Dursi | SciNet, Compute/Calcul Canada | www.SciNetHPC.ca
On 23 May 9:37PM, Jonathan Dursi wrote:
On the other hand, it works everywhere if I pad the rcounts array with
an extra valid value (0 or 1, or for that matter 783), or replace the
allgatherv with an allgather.
.. and it fails with 7 even where it worked (but succeeds with 8) if I
pad
It seems like this might also be an issue for gatherv and reduce_scatter
as well.
- Jonathan
--
Jonathan Dursi | SciNet, Compute/Calcul Canada | www.SciNetHPC.ca
some other kind
Chunk 1/2: Trying 524288 of 256xdouble, chunked, 1073741824 bytes: successfully
read 524288
Chunk 2/2: Trying 524289 of 256xdouble, chunked, 1073743872 bytes: successfully
read 524289
- Jonathan
--
Jonathan Dursi | SciNet, Compute/Calcul Canada | www.SciNetHPC.ca
#in
-np 6 -mca btl self,tcp ./diffusion-mpi
never gives any problems.
Any suggestions? I notice a mention of `improved flow control in sm' in
the ChangeLog to 1.3.3; is that a significant bugfix?
- Jonathan
--
Jonathan Dursi
program diffuse
implicit none
in
le program 10 times it will
be successful 9 or so times. But the hangs still occur.
- Jonathan
--
Jonathan Dursi
antly less frequently - it hangs one time out of every ten or
so. But obviously this is still far too often to deploy in a production
environment.
Where should we be looking to track down this problem?
- Jonathan
--
Jonathan Dursi
config.log.gz
Description: GNU Zip compr
randomly) with 1.3.3,
mpirun -np 6 -mca btl tcp,self ./diffusion-mpi
or
mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi
always succeeds, with (as one might guess) the second being much faster...
Jonathan
--
Jonathan Dursi
rightneighbour = 0
endif
to
if (leftneighbour .eq. -1) then
leftneighbour = MPI_PROC_NULL
endif
if (rightneighbour .eq. nprocs) then
rightneighbour = MPI_PROC_NULL
endif
On Sep 21, 2009, at 5:09 PM, Jonathan Dursi wrote:
Continuing the conversation with my
ed; ~1% of certain
single-node jobs hang, turning off sm or setting num_fifos to NP-1
eliminates this.
- Jonathan
--
Jonathan Dursi
enMPI isn't ready for
real production use on our system.
- Jonathan
On 2009-09-24, at 4:16PM, Eugene Loh wrote:
Jonathan Dursi wrote:
So to summarize:
OpenMPI 1.3.2 + gcc4.4.0
Test problem with periodic (left neighbour of proc 0 is proc N-1)
Sendrecv()s:
Default alway
ns of OpenMPI (1.3.2 and
1.3.3). It was working correctly with OpenMPI version 1.2.7.
[...]
GCC version :
$ mpicc --version
gcc (Ubuntu 4.4.1-4ubuntu7) 4.4.1
Does it work if you turn off the shared memory transport layer; that is,
mpirun -n 6 -mca btl ^sm ./testmpi
?
- Jonathan
--
Jonathan Dursi
penMP shared memory transport with
gcc 4.4.x.
Jonathan
--
Jonathan Dursi
ormance by packing the little messages into fewer larger messages.
Jonathan
--
Jonathan Dursi
- Jonathan
--
Jonathan Dursi
he default set of
parameters. It's also unclear whether or not this issue occurred with earlier
OpenMPI versions.
Where should I start looking to find out what is going on? Are there
parameters that can be adjusted to play with timeouts to see if the issue can
be localized, or worked around?
- Jonathan
--
Jonathan Dursi
ct
> requests.
I'm certainly willing to try it.
- Jonathan
--
Jonathan Dursi
21 matches
Mail list logo