from:"yvan . fournier"

[OMPI users] Bug in OMPI 1.0.1 using MPI_Recv with indexed datatypes

2006-02-10 Thread Yvan Fournier

Hello,

I seem to have encountered a bug in Open MPI 1.0 using indexed datatypes
with MPI_Recv (which seems to be of the "off by one" sort). I have
joined a test case, which is briefly explained below (as well as in the
source file). This case should run on two processes. I observed the bug
on 2 different Linux systems (single processor Centrino under Suse 10.0
with gcc 4.0.2, dual-processor Xeon under Debian Sarge with gcc 3.4)
with Open MPI 1.0.1, and do not observe it using LAM 7.1.1 or MPICH2.

Here is a summary of the case:

--

Each processor reads a file ("data_p0" or "data_p1") giving a list of
global element ids. Some elements (vertices from a partitionned mesh)
may belong to both processors, so their id's may appear on both
processors: we have 7178 global vertices, 3654 and 3688 of them being
known by ranks 0 and 1 respectively.

In this simplified version, we assign coordinates {x, y, z} to each
vertex equal to it's global id number for rank 1, and the negative of
that for rank 0 (assigning the same values to x, y, and z). After
finishing the "ordered gather", rank 0 prints the global id and
coordinates of each vertex.

lines should print (for example) as:
  6456 ;   6455.0   6455.0   6456.0
  6457 ;  -6457.0  -6457.0  -6457.0
depending on whether a vertex belongs only to rank 0 (negative
coordinates) or belongs to rank 1 (positive coordinates).

With the OMPI 1.0.1 bug (observed on Suse Linux 10.0 with gcc 4.0 and on
Debian sarge with gcc 3.4), we have for example for the last vertices:
  7176 ;   7175.0   7175.0   7176.0
  7177 ;   7176.0   7176.0   7177.0
seeming to indicate an "off by one" type bug in datatype handling

Not using an indexed datatype (i.e. not defining USE_INDEXED_DATATYPE
in the gather_test.c file), the bug dissapears. Using the indexed
datatype with LAM MPI 7.1.1 or MPICH2, we do not reproduce the bug
either, so it does seem to be an Open MPI issue.

--

Best regards,

Yvan Fournier


ompi_datatype_bug.tar.gz
Description: application/compressed-tar

[OMPI users] False positives and even failure with Open MPI and memchecker

2016-11-05 Thread Yvan Fournier

Hello,

I have observed what seems to be false positives running under Valgrind when 
Open MPI is build with --enable-memchecker
(at least with versions 1.10.4 and 2.0.1).

Attached is a simple test case (extracted from larger code) that sends one int 
to rank r+1, and receives from rank r-1
(using MPI_COMM_NULL to handle ranks below 0 or above comm size).

Using:

~/opt/openmpi-2.0/bin/mpicc -DVARIANT_1 vg_mpi.c
~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind ./a.out

I get the following Valgrind error for rank 1:

==8382== Invalid read of size 4
==8382==at 0x400A00: main (in /home/yvan/test/a.out)
==8382==  Address 0xffefffe70 is on thread 1's stack
==8382==  in frame #0, created by main (???:)


Using:

~/opt/openmpi-2.0/bin/mpicc -DVARIANT_2 vg_mpi.c
~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind ./a.out

I get the following Valgrind error for rank 1:

==8322== Invalid read of size 4
==8322==at 0x400A6C: main (in /home/yvan/test/a.out)
==8322==  Address 0xcb6f9a0 is 0 bytes inside a block of size 4 alloc'd
==8322==at 0x4C29BBE: malloc (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8322==by 0x400998: main (in /home/yvan/test/a.out)

I get no error for the default variant (no -D_VARIANT...) with either Open MPI 
2.0.1, or 1.10.4,
but de get an error similar to variant 1 on the parent code from which the 
example was extracted...

is given below. Running under Valgrind's gdb server, for the parent code of 
variant 1,
it even seems the value received on rank 1 is uninitialized, then Valgrind 
complains
with the given message.

The code fails to work as intended when run under Valgrind when OpenMPI is 
built with --enable-memchecker,
while it works fine when run with the same build but not under Valgrind,
or when run under Valgrind with Open MPI built without memchecker.

I'm running under Arch Linux (whosed packaged Open MPI 1.10.4 is built with 
memchecker enabled,
rendering it unusable under Valgrind).

Did anybody else encounter this type of issue, or I does my code contain an 
obvious mistake that I am missing ?
I initially though of possible alignment issues, but saw nothing in the 
standard that requires that,
and the "malloc"-base variant exhibits the same behavior,while I assume
alignment to 64-bits for allocated arrays is the default.

Best regards,

  Yvan Fournier#include 
#include 

#include 

int main(int argc, char *argv[])
{
  MPI_Status status;

  int l = 5, l_prev = 0;
  int rank_next = MPI_PROC_NULL, rank_prev = MPI_PROC_NULL;
  int rank_id = 0, n_ranks = 1, tag = 1;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank_id);
  MPI_Comm_size(MPI_COMM_WORLD, &n_ranks);
  if (rank_id > 0)
rank_prev = rank_id -1;
  if (rank_id + 1 < n_ranks)
rank_next = rank_id + 1;

#if defined(VARIANT_1)

  int sendbuf[1] = {l};
  int recvbuf[1] = {0};

  if (rank_id %2 == 0) {
MPI_Send(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD);
MPI_Recv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status);
  }
  else {
MPI_Recv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status);
MPI_Send(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD);
  }

  l_prev = recvbuf[0];

#elif defined(VARIANT_2)

  int *sendbuf = malloc(sizeof(int));
  int *recvbuf = malloc(sizeof(int));

  sendbuf[0] = l;

  if (rank_id %2 == 0) {
MPI_Send(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD);
MPI_Recv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status);
  }
  else {
MPI_Recv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status);
MPI_Send(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD);
  }

  l_prev = recvbuf[0];

#else

  if (rank_id %2 == 0) {
MPI_Send(&l, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD);
MPI_Recv(&l_prev, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status);
  }
  else {
MPI_Recv(&l_prev, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status);
MPI_Send(&l, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD);
  }

#endif

  printf("r%d, l=%d\n");

  MPI_Finalize();
  exit(0);
}

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] False positives and even failure with OpenMPI and memchecker

2016-11-05 Thread Yvan Fournier

Hello,

Yes, as I had hinted in the my message, I observed the bug in an irregular
manner.

Glad to see it could be fixed so quickly (it affects 2.0 too). I had observed it
for some time, but only recently took the time to make a proper simplified case
and investigate. Guess I should have submitted the issue sooner...

Best regards,

Yvan Fournier


> Message: 5
> Date: Sat, 5 Nov 2016 22:08:32 +0900
> From: Gilles Gouaillardet 
> To: Open MPI Users 
> Subject: Re: [OMPI users] False positives and even failure with Open
>   MPI and memchecker
> Message-ID:
>   
> Content-Type: text/plain; charset=UTF-8
> 
> that really looks like a bug
> 
> if you rewrite your program with
> 
>   MPI_Sendrecv(&l, 1, MPI_INT, rank_next, tag, &l_prev, 1, MPI_INT,
> rank_prev, tag, MPI_COMM_WORLD, &status);
> 
> or even
> 
>   MPI_Irecv(&l_prev, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &req);
> 
>   MPI_Send(&l, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD);
> 
>   MPI_Wait(&req, &status);
> 
> then there is no more valgrind warning
> 
> iirc, Open MPI marks the receive buffer as invalid memory, so it can
> check only MPI subroutine updates it. it looks like a step is missing
> in the case of MPI_Recv()
> 
> 
> Cheers,
> 
> Gilles
> 
> On Sat, Nov 5, 2016 at 9:48 PM, Gilles Gouaillardet
>  wrote:
> > Hi,
> > 
> > note your printf line is missing.
> > if you printf l_prev, then the valgrind error occurs in all variants
> > 
> > at first glance, it looks like a false positive, and i will investigate it
> > 
> > 
> > Cheers,
> > 
> > Gilles
> > 
> > On Sat, Nov 5, 2016 at 7:59 PM, Yvan Fournier  wrote:
> > > Hello,
> > > 
> > > I have observed what seems to be false positives running under Valgrind
> > > when Open MPI is build with --enable-memchecker
> > > (at least with versions 1.10.4 and 2.0.1).
> > > 
> > > Attached is a simple test case (extracted from larger code) that sends one
> > > int to rank r+1, and receives from rank r-1
> > > (using MPI_COMM_NULL to handle ranks below 0 or above comm size).
> > > 
> > > Using:
> > > 
> > > ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_1 vg_mpi.c
> > > ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind
> > > ./a.out
> > > 
> > > I get the following Valgrind error for rank 1:
> > > 
> > > ==8382== Invalid read of size 4
> > > ==8382==at 0x400A00: main (in /home/yvan/test/a.out)
> > > ==8382==  Address 0xffefffe70 is on thread 1's stack
> > > ==8382==  in frame #0, created by main (???:)
> > > 
> > > 
> > > Using:
> > > 
> > > ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_2 vg_mpi.c
> > > ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind
> > > ./a.out
> > > 
> > > I get the following Valgrind error for rank 1:
> > > 
> > > ==8322== Invalid read of size 4
> > > ==8322==at 0x400A6C: main (in /home/yvan/test/a.out)
> > > ==8322==  Address 0xcb6f9a0 is 0 bytes inside a block of size 4 alloc'd
> > > ==8322==at 0x4C29BBE: malloc (in /usr/lib/valgrind/vgpreload_memcheck-
> > > amd64-linux.so)
> > > ==8322==by 0x400998: main (in /home/yvan/test/a.out)
> > > 
> > > I get no error for the default variant (no -D_VARIANT...) with either Open
> > > MPI 2.0.1, or 1.10.4,
> > > but de get an error similar to variant 1 on the parent code from which the
> > > example was extracted...
> > > 
> > > is given below. Running under Valgrind's gdb server, for the parent code
> > > of variant 1,
> > > it even seems the value received on rank 1 is uninitialized, then Valgrind
> > > complains
> > > with the given message.
> > > 
> > > The code fails to work as intended when run under Valgrind when OpenMPI is
> > > built with --enable-memchecker,
> > > while it works fine when run with the same build but not under Valgrind,
> > > or when run under Valgrind with Open MPI built without memchecker.
> > > 
> > > I'm running under Arch Linux (whosed packaged Open MPI 1.10.4 is built
> > > with memchecker enabled,
> > > rendering it unusable under Valgrind).
> > > 
> > > Did anybody else encounter this type of issue, or I does my code contain
> > > an obvious mistake that I am missing ?
> > > I initially though of possible alignment issues, but saw nothing in the
> > &g

Re: [OMPI users] Latest Intel Compilers (ICS, version 12.1.0.233 Build 20110811) issues

2012-01-03 Thread Yvan Fournier

Hello,

I am not sure your issues are related, and I have not tested this
version of ICS, but I have actually had issues with an Intel compiler
build of Open MPI 1.4.3 on a cluster using Westmere processors and
Infiniband (Qlogic), using a Debian distribution, with our in-house code
(www.code-saturne.org).

I am not sure which version of the Intel compiler was used by the the
administrators though, as both versions 11.? and 12.0 are available. On
the same machine, using environment modules, I can run the code compiled
with Intel compilers 11 and Open MPI compiled with GCC 4.4without
issues, but if I switch to the Intel-compiled Open MPI build, I have
issues in some functions of the code, including MPI-IO.I did use the
wrappers, and they probably have some interaction with environment
modules...

I have not investigated further so far, and the code is quite complex,
environment modules are used, and I am not sure of all the details of
the machine, but as you seem to have issues with an Intel compiler, It
may be useful to bring this up (the code is open-source, and I could
provide a small test case to anyone interested in testing, but the only
thing I am relatively confident of in this case is the code itself. We
have had our share of bugs and experience in debugging, and the few
times we have had issues similar to this, either bugs in the MPI
libraries or conflicts between multiple compilers or libraries have been
the origin).

Best regards,

> Message: 1
> Date: Tue, 3 Jan 2012 16:23:02 +
> From: Richard Walsh 
> Subject: Re: [OMPI users] Latest Intel Compilers (ICS, version
>   12.1.0.233 Build 20110811) issues ...
> To: "ljdu...@scinet.utoronto.ca" , Open
>   MPI Users   
> Message-ID:
>   <762096c11c5a044a9d92961c2e1a7ce8192a4...@mbox1.flas.csi.cuny.edu>
> Content-Type: text/plain; charset="us-ascii"
> 
> 
> Jonathan/All,
> 
> Thanks for the information, but I continue to have problems.  I dropped the
> 'openib' option to simplify things and focused my attention only on OpenMPI
> version 1.4.4 because you suggested it works.
> 
> On the strength of the fact that the PGI 11.10 compiler works fine (all 
> systems
> and all versions of OpenMPI), I ran a PGI build of 1.4.4 with the '-showme'
> option (Intel fails immediately, even with '-showme' ... ).  I then 
> substituted all
> the PGI-related strings with Intel-related strings to compile directly and 
> explicitly
> outside the 'opal' wrapper using code and libraries in the Intel build tree 
> of 1.4.4,
> as follows:
> 
> pgcc -o ./hw2.exe hw2.c -I/share/apps/openmpi-pgi/1.4.4/include 
> -L/share/apps/openmpi-pgi/1.4.4/lib -lmpi -lopen-rte -lopen-pal -ldl 
> -Wl,--export-dynamic -lnsl -lutil -ldl
> 
> becomes ...
> 
> icc -o ./hw2.exe hw2.c -I/share/apps/openmpi-intel/1.4.4/include 
> -L/share/apps/openmpi-intel/1.4.4/lib -lmpi -lopen-rte -lopen-pal -ldl 
> -Wl,--export-dynamic -lnsl -lutil -ldl
> 
> Interestingly, this direct-explicit Intel compile >>WORKS FINE<< (no segment 
> fault like with the wrapped version)
> and the executable produced also >>RUNS FINE<<.  So ... it looks to me like 
> there is something wrong with using
> the 'opal' wrappper generated-used in the Intel build.
> 
> Can someone make a suggestion ... ?? I would like to use the wrappers of 
> course.
> 
> Thanks,
> 
> rbw
> 
> Richard Walsh
> Parallel Applications and Systems Manager
> CUNY HPC Center, Staten Island, NY
> W: 718-982-3319
> M: 612-382-4620
> 
> Right, as the world goes, is only in question between equals in power, while 
> the strong do what they can and the weak suffer what they must.  -- 
> Thucydides, 400 BC
> 
> 
> From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of 
> Jonathan Dursi [ljdu...@scinet.utoronto.ca]
> Sent: Tuesday, December 20, 2011 4:48 PM
> To: Open Users
> Subject: Re: [OMPI users] Latest Intel Compilers (ICS,  version 12.1.0.233 
> Build 20110811) issues ...
> 
> For what it's worth, 1.4.4 built with the intel 12.1.0.233 compilers has been 
>  the default mpi at our centre for over a month and we haven't had any 
> problems...
> 
>- jonathan
> --
> Jonathan Dursi; SciNet, Compute/Calcul Canada
> 
> -Original Message-
> From: Richard Walsh 
> Sender: users-boun...@open-mpi.org
> Date: Tue, 20 Dec 2011 21:14:44
> To: Open MPI Users
> Reply-To: Open MPI Users 
> Subject: Re: [OMPI users] Latest Intel Compilers (ICS,
>  version 12.1.0.233 Build 20110811) issues ...
> 
> 
> All,
> 
> I have not heard anything back on the inquiry below, so I take it
> that no one has had any issues with Intel's latest compiler release,
> or perhaps has not tried it yet.
> 
> Thanks,
> 
> rbw
> 
> Richard Walsh
> Parallel Applications and Systems Manager
> CUNY HPC Center, Staten Island, NY
> W: 718-982-3319
> M: 612-382-4620
> 
> Right, as the world goes, is only in question between equals in power, while 
> the strong do what they can and the weak suffer what they must.

Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi

2012-07-11 Thread Yvan Fournier

On Jul 10, 2012, at 7:31 AM, Dugenoux Albert wrote:
> 
> Hi.
> ?
> I have recently built a cluster upon a Dell PowerEdge Server with a Debian 
> 6.0 OS. This server is composed of 
> 4 system board of 2 processors of hexacores. So it gives 12 cores?per system 
> board.
> The boards are linked with a local Gbits switch. 
> ?
> In order to?parallelize the software Code Saturne, which is a CFD solver, I 
> have configured?the cluster
> such that there are?a pbs server/mom on 1 system board and?3 mom and the?3 
> others cards. So this leads to 
> 48 cores dispatched on?4 nodes of 12 CPU. Code saturne is compiled with the 
> openmpi 1.6 version.
> ?
> When I launch a simulation using 2 nodes?with 12 cores,?elapse time is good 
> and network traffic is not full.
> But when I launch the same simulation using 3 nodes with 8 cores, elapse time 
> is 5 times the previous one.
> I?both cases, I use 24 cores and network seems not to be satured. 
> ?
> I have tested several configurations : binaries in local file system or?on a 
> NFS. But results are the same.
> I have visited severals forums (in particular 
> http://www.open-mpi.org/community/lists/users/2009/08/10394.php)
> and read lots of threads, but as I am not an expert at clusters, I presently 
> do not see where it?is wrong !
> ?
> Is it a problem in the configuration of PBS (I have installed it from the deb 
> packages), a subtile compilation options
> of openMPI, or a bad?network configuration??
> ?
> Regards.
> ?
> B. S.
> 

Hello,

I am a Code_Saturne developer, so I can confirm a few comments from
others on this list:

- Most of the communication of the code is latency-bound: we use
iterative linear solvers, which make a heavy use of MPI_Allreduce, with
only 1 to 3 double precision values per reduction. I do not know if
modern "fast" Ethernet variants on a small number of switches make a big
difference, but tests made a few years ago on a Cluster using a a SCALI
network (fast/low latency at the time) led to the conclusion that the
code performance was divided by 2 on an Ethernet network. These tests
need to be updated, but your results seem consistent.

- Actually, on an Infiniband cluster using Open MPI 1.4.3 (such as the
one described here: http://i.top500.org/system/177030), performance
tends to be better in some cases when spreading a constant number of
cores on more nodes, as the code is quite memory-bandwidth intensive.
Depending on the data size on each node, this may be significant or lead
to only minor performance differences.
The network topology may also affect performance (tests using SLURMS
--switches options confirms this), as well as binding processes to
cores.

- In recent years, the code has been used and tested mainly on
workstations (shared memory), Infiniband clusters, or IBM Blue Gene (L,
P, and Q) or a Cray XT (5 and 6) then XE-6 machine. I am interested in
trying to improve or at least try to improve performance on Ethernet
clusters, and I may have a few suggestions for options you can test, but
this conversation should probably move to the Code_Saturne forum
(http://code-saturne.org), as we will go into some options of our linear
solvers which are specific to that code, not to Open MPI.

Best regards,

  Yvan Fournier

[OMPI users] Bug in Open MPI 1.2.3 using MPI_Recv with an indexed datatype

2007-09-24 Thread Yvan Fournier

Hello,

I seem to have encountered a new bug in Open MPI 1.2.3 using indexed
datatypes with MPI_Recv (which seems to be of the "off by one" sort).
I have (different from the bug I submitted in 2006 and which was
corrected since).

This bug leads to a segfault, and I have only encountered it one
one data set (on a relatively large set for a 2-processor run).
I have reproduced the segfault on 2 different linux Systems
(Debian Sarge on dual-processor Intel Xeon, Kubuntu 7.04
on single-processor Centrino system).

A means to reproduce it on 2 ranks can be found at :
http://yvan.fournier.free.fr/OpenMPI/ompi_datatype_bug_2.tar.gz

(the program is very simple, but the displacements array required to
reproduce it is too large for the mailing list).

The program does not print any output, but does not segfault when
functioning properly, or when USE_INDEXED_DATATYPE is unset (lines
57-58). It works with LAM 7.1.1 and MPICH2, but fails under Open MPI.
This is a (much) simplified extract from a part of Code_Saturne's
FVM library (http://rd.edf.com/code_saturne/), which otherwise works
fine on most data using Open MPI.

Best regards,

    Yvan Fournier

[OMPI users] bug in MPI_File_get_position_shared ?

2008-08-13 Thread Yvan Fournier

Hello,

I seem to have encountered a bug in MPI-IO, in which
MPI_File_get_position_shared hangs when called by multiple processes in
a communicator. It can be illustrated by the following simple test case,
in which a file is simply created with C IO, and opened with MPI-IO.
(defining or undefining MY_MPI_IO_BUG on line 5 enables/disables the
bug). From the MPI2 documentation, It seems that all processes should be
able to call MPI_File_get_position_shared, but if more than one process
uses it, it fails. Setting the shared pointer helps, but this should not
be necessary, and the code still hangs (in more complete code, after
writing data).

I encounter the same problem with Open MPI 1.2.6 and MPICH2 1.0.7, so
I may have misread the documentation, but I suspect a ROMIO bug.

Best regards,

Yvan Fournier


/*
 * Parallel file I/O shared pointer bug test
 **/

#define MY_MPI_IO_BUG 1

/*
 * Standard C library headers
 **/

#include 
#include 
#include 
#include 

#include 

/**/

#ifdef __cplusplus
extern "C" {
#if 0
} /* Fake brace to force Emacs auto-indentation back to column 0 */
#endif
#endif /* __cplusplus */

/*
 * Private function definitions
 **/

/*
 * Output MPI error message.
 *
 * This supposes that the default MPI errorhandler is not used
 *
 * parameters:
 *   error_code <-- associated MPI error code
 *
 * returns:
 *   0 in case of success, system error code in case of failure
 **/

static void
_mpi_io_error_message(int error_code)
{
  char buffer[MPI_MAX_ERROR_STRING];
  int  buffer_len;

  MPI_Error_string(error_code, buffer, &buffer_len);

  printf("MPI IO error %d: %s", error_code, buffer);
}

/*
 * Return the position of the file pointer.
 *
 * When using MPI-IO with individual file pointers, we consider the file
 * pointer to be equal to the highest value of then individual file pointers.
 *
 * parameters:
 *   fh <-- MPI IO file descriptor
 *
 * returns:
 *   current position of the file pointer
 **/

MPI_Offset
_mpi_file_tell(MPI_File  fh)
{
  int errcode = MPI_SUCCESS;
  MPI_Offset offset = 0, disp = 0, retval = 0;

  int rank;

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

#if defined(MY_MPI_IO_BUG)

  printf("rank %d: will call MPI_File_get_position_shared\n", rank);

  errcode = MPI_File_get_position_shared(fh, &offset);

  if (errcode == MPI_SUCCESS) {
MPI_File_get_byte_offset(fh, offset, &disp);
retval = disp;
  }

  printf("rank %d: offsets: %ld %ld\n", rank, (long)offset, (long)disp);

#else

  long aux[2];
  if (rank == 0) {

printf("root rank will call MPI_File_get_position_shared\n");

errcode = MPI_File_get_position_shared(fh, &offset);

if (errcode == MPI_SUCCESS) {
  MPI_File_get_byte_offset(fh, offset, &disp);
  retval = disp;
}
aux[0] = disp; aux[1] = retval;
  }

  MPI_Bcast(aux, 2, MPI_LONG, 0, MPI_COMM_WORLD);
  disp = aux[0];
  retval = aux[1];

  printf("rank %d: offsets: %ld %ld\n", rank, (long)offset, (long)disp);

#endif

  if (errcode != MPI_SUCCESS)
_mpi_io_error_message(errcode);

  return retval;
}

/*
 * Unit test
 **/

static void
_create_test_data(void)
{
  int i;
  FILE *f;

  char header[80];
  char footer[80];

  sprintf(header, "fvm test file");
  for (i = strlen(header); i < 80; i++)
header[i] = '\0';

  sprintf(footer, "fvm test file end");
  for (i = strlen(footer); i < 80; i++)
footer[i] = '\0';

  f = fopen("file_test_data", "w+");

  fwrite(header, 1, 80, f);
  fwrite(footer, 1, 80, f);

  fclose(f);
}

/*---*/

int
main (int argc, char *argv[])
{
  int rank = 0;
  int retval = MPI_SUCCESS;
  MPI_Offset offset;

  MPI_File fh = MPI_FILE_NULL;

  /* Initialization */

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  if (rank == 0)
_create_test_data();

  /* Open file

Re: [OMPI users] bug in MPI_File_get_position_shared ?

2008-08-17 Thread Yvan Fournier

Thanks.

I had also posted the bug on the MPICH2 list, and received an
aswer from the ROMIO maintainers: the issue seems to be related to
NFS file locking bugs. I had been testing on an NFS system, and
when I re-tested under a local (ext3) file system, I did not reproduce
the bug.

I had been experimenting with the MPI-IO using explicit offsets,
individual pointers, and shared pointers, and have workarounds,
so I'll just avoid shared pointers on NFS.

Best regards,

    Yvan Fournier

EDF R&D

On Sat, 2008-08-16 at 08:19 -0400, users-requ...@open-mpi.org wrote:

> Date: Sat, 16 Aug 2008 08:05:14 -0400
> From: Jeff Squyres 
> Subject: Re: [OMPI users] bug in MPI_File_get_position_shared ?
> To: Open MPI Users 
> Cc: mpich2-ma...@mcs.anl.gov
> Message-ID: <023f1db0-8e8d-4c8c-8156-80ae52ff0...@cisco.com>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
> 
> On Aug 13, 2008, at 7:06 PM, Yvan Fournier wrote:
> 
> > I seem to have encountered a bug in MPI-IO, in which
> > MPI_File_get_position_shared hangs when called by multiple processes  
> > in
> > a communicator. It can be illustrated by the following simple test  
> > case,
> > in which a file is simply created with C IO, and opened with MPI-IO.
> > (defining or undefining MY_MPI_IO_BUG on line 5 enables/disables the
> > bug). From the MPI2 documentation, It seems that all processes  
> > should be
> > able to call MPI_File_get_position_shared, but if more than one  
> > process
> > uses it, it fails. Setting the shared pointer helps, but this should  
> > not
> > be necessary, and the code still hangs (in more complete code, after
> > writing data).
> >
> > I encounter the same problem with Open MPI 1.2.6 and MPICH2 1.0.7, so
> > I may have misread the documentation, but I suspect a ROMIO bug.
> 
> Bummer.  :-(
> 
> It would be best to report this directly to the ROMIO maintainers via 
> romio-ma...@mcs.anl.gov 
> .  They lurk on this list, but they may not be paying attention to  
> every mail.
> 
> If you wouldn't mind, please CC me on the mail to romio-maint.  Thanks!
> 
> -- 
> Jeff Squyres
> Cisco Systems

[OMPI users] MPI IO bug test case for OpenMPI 1.3

2009-07-09 Thread yvan . fournier

Hello,

Some weeks ago, I reported a problem using MPI IO in OpenMPI 1.3,
which did not occur with OpenMPI 1.2 or MPICH2.

The bug was encountered with the Code_Saturne CFD tool 
(http://www.code-saturne.org),
and seemed to be an issue with individual file pointers, as another mode using
explicit offsets worked fine.

I have finally extracted the read pattern from the complete case, so as to
generate the simple test case attached. Further testing showed that the
bug could be reproduced easily using only part of the read pattern,
so I commented most of the patterns from the original case using #if 0 / #endif.

The test should be run with an MPI_COMM_WORLD size of 2. Initially,
rank 0 generates a simple binary file using Posix I/O and
containing the values 0, 1, 2, ... up to about 30.

The file is then opened for reading using MPI IO, and as the values
expected at a given offset are easily determined, read values are compared
to expected values, and MPI_Abort is called in case of an error.

I also added a USE_FILE_TYPE macro definition, which can be undefined
to "turn off" the bug.

Basically, I have:



#ifdef USE_FILE_TYPE
  MPI_Type_hindexed(1, lengths, disps, MPI_BYTE, &file_type);
  MPI_Type_commit(&file_type);
  MPI_File_set_view(fh, offset, MPI_BYTE, file_type, datarep, MPI_INFO_NULL);
#else
  MPI_File_set_view(fh, offset+disps[0], MPI_BYTE, MPI_BYTE, datarep, 
MPI_INFO_NULL);
#endif

retval = MPI_File_read_all(fh, buf, (int)(lengths[0]), MPI_BYTE, &status);

#if USE_FILE_TYPE
  MPI_Type_free(&file_type);
#endif

-

Using the file type indexed datatype, I exhibit the bug with both
versions 1.3.0 and 1.3.2 of OpenMPI.

Best regards,

  Yvan Fournier


#include 
#include 
#include 
#include 

#define USE_FILE_TYPE 1
/* #undef USE_FILE_TYPE */

static void
_create_test_data(void)
{
  int i, j;
  FILE *f;

  int buf[1024];

  f = fopen("test_data", "w");

  for (i = 0; i < 300; i++) {
for (j = 0; j < 1024; j++)
  buf[j] = i*1024 + j;
fwrite(buf, sizeof(int), 1024, f);
  }

  fclose(f);
}

static void
_mpi_io_error_message(int error_code)
{
  char buffer[MPI_MAX_ERROR_STRING];
  int  buffer_len;

  MPI_Error_string(error_code, buffer, &buffer_len);

  fprintf(stderr, "MPI IO error: %s\n", buffer);
}

static void
_test_for_corruption(int  buf[],
 int  base_offset,
 int  rank_offset,
 int  ni)
{
  int i;
  int n_ints = ni / sizeof(int);
  int int_shift = (base_offset + rank_offset) / sizeof(int);

  for (i = 0; i < n_ints; i++) {
if (buf[i] != int_shift + i) {
  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  printf("i = %d, buf = %d, ref = %d\n",
 i, buf[i], int_shift + i);
  fprintf(stderr,
  "rank %d, base offset %d, rank offset %d, size %d: corruption\n",
 rank, base_offset, rank_offset, ni);
  MPI_Abort(MPI_COMM_WORLD, 1);
}
  }
}

static void
_read_global_block(MPI_File   fh,
   intoffset,
   intni)
{
  MPI_Datatype file_type;
  MPI_Aint disps[1];
  MPI_Status status;
  int *buf;
  int lengths[1];
  char datarep[] = "native";
  int retval = 0;

  lengths[0] = ni;
  disps[0] = 0;

  buf = malloc(ni);
  assert(buf != NULL);

  MPI_Type_hindexed(1, lengths, disps, MPI_BYTE, &file_type);
  MPI_Type_commit(&file_type);
  MPI_File_set_view(fh, offset, MPI_BYTE, file_type, datarep, MPI_INFO_NULL);

  retval = MPI_File_read_all(fh, buf, ni, MPI_BYTE, &status);

  MPI_Type_free(&file_type);

  if (retval != MPI_SUCCESS)
_mpi_io_error_message(retval);

  _test_for_corruption(buf, offset, 0, ni);

  free(buf);
}

static void
_read_block_ip(MPI_File   fh,
   intoffset,
   intdispl,
   intni)
{
  int errcode;
  int *buf;
  int lengths[1];
  MPI_Aint disps[1];
  MPI_Status status;
  MPI_Datatype file_type;

  char datarep[] = "native";
  int retval = 0;

  buf = malloc(ni);
  assert(buf != NULL);

  lengths[0] = ni;
  disps[0] = displ;

#ifdef USE_FILE_TYPE
  MPI_Type_hindexed(1, lengths, disps, MPI_BYTE, &file_type);
  MPI_Type_commit(&file_type);

  MPI_File_set_view(fh, offset, MPI_BYTE, file_type, datarep, MPI_INFO_NULL);
#else
  MPI_File_set_view(fh, offset+displ, MPI_BYTE, MPI_BYTE, datarep, MPI_INFO_NULL);
#endif

  retval = MPI_File_read_all(fh, buf, (int)(lengths[0]), MPI_BYTE, &status);

  if (retval != MPI_SUCCESS)
_mpi_io_error_message(retval);

#if USE_FILE_TYPE
  MPI_Type_free(&file_type);
#endif

  _test_for_corruption(buf, offset, displ, ni);

  free(buf);
}

int main(int argc, char **argv)
{
  int rank;
  int retval;
  MPI_File fh;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  if (rank == 0) {
_create_test_data();
  }

[OMPI users] False positives with OpenMPI and memchecker

2018-01-06 Thread yvan . fournier


Hello,

I obtain false positives with OpenMPI when memcheck is enabled, using OpenMPI 
3.0.0 

This is similar to an issue I had reported and had been fixed in Nov. 2016, but 
affects MPI_Isend/MPI_Irecv instead of MPI_Send/MPI_Recv.
I had not done much additional testing on my application using memchecker 
since, so probably may have missed remaining issues at the time.

In the attached test (which has 2 optional variants relating to whether the 
send and receive buffers are allocated on the stack or heap, but exhibit the 
same basic issue), I have (running "mpicc vg_ompi_isend_irecv.c && -g mpiexec 
-n 2 ./a.out"):

==19651== Memcheck, a memory error detector
==19651== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==19651== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==19651== Command: ./a.out
==19651== 
==19650== Thread 3:
==19650== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
==19650==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so)
==19650==by 0x5A5A9FA: epoll_dispatch (epoll.c:407)
==19650==by 0x5A5EA9A: opal_libevent2022_event_base_loop (event.c:1630)
==19650==by 0x94C96ED: progress_engine (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pmix_pmix2x.so)
==19650==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so)
==19650==by 0x547042E: clone (in /usr/lib/libc-2.26.so)
==19650==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==19650== 
==19651== Thread 3:
==19651== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
==19651==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so)
==19651==by 0x5A5A9FA: epoll_dispatch (epoll.c:407)
==19651==by 0x5A5EA9A: opal_libevent2022_event_base_loop (event.c:1630)
==19651==by 0x94C96ED: progress_engine (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pmix_pmix2x.so)
==19651==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so)
==19651==by 0x547042E: clone (in /usr/lib/libc-2.26.so)
==19651==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==19651== 
==19650== Thread 1:
==19650== Invalid read of size 2
==19650==at 0x4C33BA0: memmove (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==19650==by 0x5A27C85: opal_convertor_pack (in 
/home/yvan/opt/openmpi-3.0/lib/libopen-pal.so.40.0.0)
==19650==by 0xD177EF1: mca_btl_vader_sendi (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_btl_vader.so)
==19650==by 0xE1A7F31: mca_pml_ob1_send_inline.constprop.4 (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so)
==19650==by 0xE1A8711: mca_pml_ob1_isend (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so)
==19650==by 0x4EB4C83: PMPI_Isend (in 
/home/yvan/opt/openmpi-3.0/lib/libmpi.so.40.0.0)
==19650==by 0x108B24: main (vg_ompi_isend_irecv.c:63)
==19650==  Address 0x1ffefffcc4 is on thread 1's stack
==19650==  in frame #6, created by main (vg_ompi_isend_irecv.c:7)

The first 2 warnings seem to relate to initialization, so are not a big issue, 
but the last one occurs whenever I use MPI_Isend, so they are a more important 
issue.

Using a version built without --enable-memchecker, I also have the two 
initialization warnings, but not the warning from MPI_Isend...

Best regards,

  Yvan Fournier


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] False positives with OpenMPI and memchecker (with attachment)

2018-01-06 Thread yvan . fournier

Hello,

Sorry, I forgot the attached test case in my previous message... :(

Best regards,

  Yvan Fournier

- Mail transferred -
From: "yvan fournier" 
To: users@lists.open-mpi.org
Sent: Sunday January 7 2018 01:43:16
Object: False positives with OpenMPI and memchecker

Hello,

I obtain false positives with OpenMPI when memcheck is enabled, using OpenMPI 
3.0.0 

This is similar to an issue I had reported and had been fixed in Nov. 2016, but 
affects MPI_Isend/MPI_Irecv instead of MPI_Send/MPI_Recv.
I had not done much additional testing on my application using memchecker 
since, so probably may have missed remaining issues at the time.

In the attached test (which has 2 optional variants relating to whether the 
send and receive buffers are allocated on the stack or heap, but exhibit the 
same basic issue), I have (running "mpicc vg_ompi_isend_irecv.c && -g mpiexec 
-n 2 ./a.out"):

==19651== Memcheck, a memory error detector
==19651== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==19651== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==19651== Command: ./a.out
==19651== 
==19650== Thread 3:
==19650== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
==19650==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so)
==19650==by 0x5A5A9FA: epoll_dispatch (epoll.c:407)
==19650==by 0x5A5EA9A: opal_libevent2022_event_base_loop (event.c:1630)
==19650==by 0x94C96ED: progress_engine (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pmix_pmix2x.so)
==19650==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so)
==19650==by 0x547042E: clone (in /usr/lib/libc-2.26.so)
==19650==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==19650== 
==19651== Thread 3:
==19651== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
==19651==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so)
==19651==by 0x5A5A9FA: epoll_dispatch (epoll.c:407)
==19651==by 0x5A5EA9A: opal_libevent2022_event_base_loop (event.c:1630)
==19651==by 0x94C96ED: progress_engine (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pmix_pmix2x.so)
==19651==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so)
==19651==by 0x547042E: clone (in /usr/lib/libc-2.26.so)
==19651==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==19651== 
==19650== Thread 1:
==19650== Invalid read of size 2
==19650==at 0x4C33BA0: memmove (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==19650==by 0x5A27C85: opal_convertor_pack (in 
/home/yvan/opt/openmpi-3.0/lib/libopen-pal.so.40.0.0)
==19650==by 0xD177EF1: mca_btl_vader_sendi (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_btl_vader.so)
==19650==by 0xE1A7F31: mca_pml_ob1_send_inline.constprop.4 (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so)
==19650==by 0xE1A8711: mca_pml_ob1_isend (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so)
==19650==by 0x4EB4C83: PMPI_Isend (in 
/home/yvan/opt/openmpi-3.0/lib/libmpi.so.40.0.0)
==19650==by 0x108B24: main (vg_ompi_isend_irecv.c:63)
==19650==  Address 0x1ffefffcc4 is on thread 1's stack
==19650==  in frame #6, created by main (vg_ompi_isend_irecv.c:7)

The first 2 warnings seem to relate to initialization, so are not a big issue, 
but the last one occurs whenever I use MPI_Isend, so they are a more important 
issue.

Using a version built without --enable-memchecker, I also have the two 
initialization warnings, but not the warning from MPI_Isend...

Best regards,

  Yvan Fournier


#include 
#include 

#include 

int main(int argc, char *argv[])
{
  MPI_Request request[2];
  MPI_Status status[2];

  int l = 5, l_prev = 0;
  int rank_next = MPI_PROC_NULL, rank_prev = MPI_PROC_NULL;
  int rank_id = 0, n_ranks = 1, tag = 1;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank_id);
  MPI_Comm_size(MPI_COMM_WORLD, &n_ranks);
  if (rank_id > 0)
rank_prev = rank_id -1;
  if (rank_id + 1 < n_ranks)
rank_next = rank_id + 1;

#if defined(VARIANT_1)

  int sendbuf[1] = {l};
  int recvbuf[1] = {0};

  if (rank_id %2 == 0) {
MPI_Isend(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD, &(request[0]));
MPI_Irecv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &(request[1]));
  }
  else {
MPI_Irecv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &(request[0]));
MPI_Isend(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD, &(request[1]));
  }
  MPI_Waitall(2, request, status);

  l_prev = recvbuf[0];

#elif defined(VARIANT_2)

  int *sendbuf = malloc(sizeof(int));
  int *recvbuf = malloc(sizeof(int));

  sendbuf[0] = l;

  if (rank_id %2 == 0) {
MPI_Isend(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD, &(request[0]));
MPI_Irecv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &(request[1]));
  }
  else {
MPI_Irecv(recv

Re: [OMPI users] False positives with OpenMPI and memchecker (seems fixed between 3.0.0 and 3.0.1-rc1)

2018-01-06 Thread yvan . fournier

Hello,

Answering myself here: checking the revision history, commits
3b8b8c52c519f64cb3ff147db49fcac7cbd0e7d7 or 
66c9485e77f7da9a212ae67c88a21f95f13e6652 (in master) seem to relate to this, so 
I checked using the latest downloadable 3.0.x nightly release, and do not 
reproduce the issue anymore...

Sorry for the (too-late) report...

  Yvan

- Mail original -
From: "yvan fournier" 
To: users@lists.open-mpi.org
Sent: Sunday January 7 2018 01:52:04
Object: Re: False positives with OpenMPI and memchecker (with attachment)

Hello,

Sorry, I forgot the attached test case in my previous message... :(

Best regards,

  Yvan Fournier

- Mail transferred -----
From: "yvan fournier" 
To: users@lists.open-mpi.org
Sent: Sunday January 7 2018 01:43:16
Object: False positives with OpenMPI and memchecker

Hello,

I obtain false positives with OpenMPI when memcheck is enabled, using OpenMPI 
3.0.0 

This is similar to an issue I had reported and had been fixed in Nov. 2016, but 
affects MPI_Isend/MPI_Irecv instead of MPI_Send/MPI_Recv.
I had not done much additional testing on my application using memchecker 
since, so probably may have missed remaining issues at the time.

In the attached test (which has 2 optional variants relating to whether the 
send and receive buffers are allocated on the stack or heap, but exhibit the 
same basic issue), I have (running "mpicc vg_ompi_isend_irecv.c && -g mpiexec 
-n 2 ./a.out"):

==19651== Memcheck, a memory error detector
==19651== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==19651== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==19651== Command: ./a.out
==19651== 
==19650== Thread 3:
==19650== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
==19650==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so)
==19650==by 0x5A5A9FA: epoll_dispatch (epoll.c:407)
==19650==by 0x5A5EA9A: opal_libevent2022_event_base_loop (event.c:1630)
==19650==by 0x94C96ED: progress_engine (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pmix_pmix2x.so)
==19650==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so)
==19650==by 0x547042E: clone (in /usr/lib/libc-2.26.so)
==19650==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==19650== 
==19651== Thread 3:
==19651== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
==19651==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so)
==19651==by 0x5A5A9FA: epoll_dispatch (epoll.c:407)
==19651==by 0x5A5EA9A: opal_libevent2022_event_base_loop (event.c:1630)
==19651==by 0x94C96ED: progress_engine (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pmix_pmix2x.so)
==19651==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so)
==19651==by 0x547042E: clone (in /usr/lib/libc-2.26.so)
==19651==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==19651== 
==19650== Thread 1:
==19650== Invalid read of size 2
==19650==at 0x4C33BA0: memmove (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==19650==by 0x5A27C85: opal_convertor_pack (in 
/home/yvan/opt/openmpi-3.0/lib/libopen-pal.so.40.0.0)
==19650==by 0xD177EF1: mca_btl_vader_sendi (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_btl_vader.so)
==19650==by 0xE1A7F31: mca_pml_ob1_send_inline.constprop.4 (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so)
==19650==by 0xE1A8711: mca_pml_ob1_isend (in 
/home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so)
==19650==by 0x4EB4C83: PMPI_Isend (in 
/home/yvan/opt/openmpi-3.0/lib/libmpi.so.40.0.0)
==19650==by 0x108B24: main (vg_ompi_isend_irecv.c:63)
==19650==  Address 0x1ffefffcc4 is on thread 1's stack
==19650==  in frame #6, created by main (vg_ompi_isend_irecv.c:7)

The first 2 warnings seem to relate to initialization, so are not a big issue, 
but the last one occurs whenever I use MPI_Isend, so they are a more important 
issue.

Using a version built without --enable-memchecker, I also have the two 
initialization warnings, but not the warning from MPI_Isend...

Best regards,

  Yvan Fournier


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Incorrect results with MPI-IO under OpenMPI v1.3.1

2009-04-06 Thread Yvan Fournier

Hello to all,

I have also encountered a similar bug with MPI-IO
with Open MPI 1.3.1, reading a Code_Saturne preprocessed mesh file
(www.code-saturne.org). Reading the file can be done using 2 MPI-IO
modes, or one non-MPI-IO mode.

The first MPI-IO mode uses individual file pointers, and  involves a
series of MPI_File_Read_all with all ranks using the same view (for
record headers), interlaced with MPI_File_Read_all with ranks using
different views (for record data, successive blocks being read by each
rank).

The second MPI-IO mode uses explicit file offsets, with
MPI_File_read_at_all instead of MPI_File_read_all.

Both MPI-IO modes seem to work fine with OpenMPI 1.2, MPICH 2,
and variants on IBM Blue Gene/L and P, as well as Bull Novascale,
but with OpenMPI 1.3.1, data read seems to be corrupt on at least
one file using the individual file pointers approach (though it
works well using explicit offsets).

The bug does not appear in unit tests, and it only appears after several
records are read on the case that does fail (on 2 ranks), so to
reproduce it with a simple program, I would have to extract the exact
file access patterns from the exact case which fails, which would
require a few extra hours of work.

If the bug is not reproduced in a simpler manner first, I will try
to build a simple program reproducing the bug within a week or 2,
but In the meantime, I just want to confirm Scott's observation
(hoping it is the same bug).

Best regards,

    Yvan Fournier

On Mon, 2009-04-06 at 16:03 -0400, users-requ...@open-mpi.org wrote:

> Date: Mon, 06 Apr 2009 12:16:18 -0600
> From: Scott Collis 
> Subject: [OMPI users] Incorrect results with MPI-IO under OpenMPI
>   v1.3.1
> To: us...@open-mpi.org
> Message-ID: 
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
> 
> I have been a user of MPI-IO for 4+ years and have a code that has run  
> correctly with MPICH, MPICH2, and OpenMPI 1.2.*
> 
> I recently upgraded to OpenMPI 1.3.1 and immediately noticed that my  
> MPI-IO generated output files are corrupted.  I have not yet had a  
> chance to debug this in detail, but it appears that  
> MPI_File_write_all() commands are not placing information correctly on  
> their file_view when running with more than 1 processor (everything is  
> okay with -np 1).
> 
> Note that I have observed the same incorrect behavior on both Linux  
> and OS-X.  I have also gone back and made sure that the same code  
> works with MPICH, MPICH2, and OpenMPI 1.2.* so I'm fairly confident  
> that something has been changed or broken as of OpenMPI 1.3.*.  Just  
> today, I checked out the SVN repository version of OpenMPI and built  
> and tested my code with that and the results are incorrect just as for  
> the 1.3.1 tarball.
> 
> While I plan to continue to debug this and will try to put together a  
> small test that demonstrates the issue, I thought that I would first  
> send out this message to see if this might trigger a thought within  
> the OpenMPI development team as to where this issue might be.
> 
> Please let me know if you have any ideas as I would very much  
> appreciate it!
> 
> Thanks in advance,
> 
> Scott
> --
> Scott Collis
> sscol...@me.com
>

[OMPI users] Datatype bug regression from Open MPI 1.0.2 to Open MPI 1.1

2006-06-30 Thread Yvan Fournier

Hello,

I had encountered a bug in Open MPI 1.0.1 using indexed datatypes
with MPI_Recv (which seems to be of the "off by one" sort), which
was corrected in Open MPI 1.0.2.

It seems to have resurfaced in Open MPI 1.1 (I encountered it using
different data and did not recognize it immediately, but it seems
it can reproduced using the same simplified test I had sent
the first time, which I re-attach here just in case).

Here is a summary of the case:

--

Each processor reads a file ("data_p0" or "data_p1") giving a list of
global element ids. Some elements (vertices from a partitionned mesh)
may belong to both processors, so their id's may appear on both
processors: we have 7178 global vertices, 3654 and 3688 of them being
known by ranks 0 and 1 respectively.

In this simplified version, we assign coordinates {x, y, z} to each
vertex equal to it's global id number for rank 1, and the negative of
that for rank 0 (assigning the same values to x, y, and z). After
finishing the "ordered gather", rank 0 prints the global id and
coordinates of each vertex.

lines should print (for example) as:
  6456 ;   6455.0   6455.0   6456.0
  6457 ;  -6457.0  -6457.0  -6457.0
depending on whether a vertex belongs only to rank 0 (negative
coordinates) or belongs to rank 1 (positive coordinates).

With the OMPI 1.0.1 bug (observed on Suse Linux 10.0 with gcc 4.0 and on
Debian sarge with gcc 3.4), we have for example for the last vertices:
  7176 ;   7175.0   7175.0   7176.0
  7177 ;   7176.0   7176.0   7177.0
seeming to indicate an "off by one" type bug in datatype handling

Not using an indexed datatype (i.e. not defining USE_INDEXED_DATATYPE
in the gather_test.c file), the bug dissapears.

--

Best regards,

Yvan Fournier



ompi_datatype_bug.tar.gz
Description: application/compressed-tar

Re: [OMPI users] users Digest, Vol 328, Issue 1

2006-07-10 Thread Yvan Fournier

Hello,

I just retried replicating the datatype bug on a SUSE Linux 10.1 system
(on a 32-bit Pentium-M system). Actually, I even get a segmentation
fault at some point. I attach the logfile for the test case
compiled in debug mode, run once directly, the again with valgrind,
as well as my ompi_info output.

I have also encountered the bug on the "parent" case (similar, but
more complex) on my work machine (dual Xeon under Debian Sarge),
but I'll check this simpler test on it just in case.

Best regards,

    Yvan Fournier



On Sun, 2006-07-09 at 12:00 -0400, users-requ...@open-mpi.org wrote:
> Send users mailing list submissions to
>   us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>   http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>   users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
>   users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
>1. Re: Datatype bug regression from Open MPI 1.0.2 to Open MPI
>   1.1 (George Bosilca)
> 
> 
> --
> 
> Message: 1
> Date: Sat, 8 Jul 2006 13:47:05 -0400 (Eastern Daylight Time)
> From: George Bosilca 
> Subject: Re: [OMPI users] Datatype bug regression from Open MPI 1.0.2
>   to Open MPI 1.1
> To: Open MPI Users 
> Message-ID: 
> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
> 
> Yvan,
> 
> I'm unable to replicate this one with the latest Open MPI trunk version. 
> As there is no difference between the trunk and the latest 1.1 version on 
> the datatype, I think the bug cannot be reproduced using the 1.1 either. I 
> compiled the test twice once using the indexed datatype and once without 
> and the output is exactly the same. I run it on my Apple G5 desktop as 
> well as on a cluster of AMD 64, over shared memory and TCP. Can you please 
> recheck that your error is comming from the type indexed please.
> 
>Thanks,
>  george.
> 
> 
> On Sat, 1 Jul 2006, Yvan Fournier wrote:
> 
> > Hello,
> >
> > I had encountered a bug in Open MPI 1.0.1 using indexed datatypes
> > with MPI_Recv (which seems to be of the "off by one" sort), which
> > was corrected in Open MPI 1.0.2.
> >
> > It seems to have resurfaced in Open MPI 1.1 (I encountered it using
> > different data and did not recognize it immediately, but it seems
> > it can reproduced using the same simplified test I had sent
> > the first time, which I re-attach here just in case).
> >
> > Here is a summary of the case:
> >
> > --
> >
> > Each processor reads a file ("data_p0" or "data_p1") giving a list of
> > global element ids. Some elements (vertices from a partitionned mesh)
> > may belong to both processors, so their id's may appear on both
> > processors: we have 7178 global vertices, 3654 and 3688 of them being
> > known by ranks 0 and 1 respectively.
> >
> > In this simplified version, we assign coordinates {x, y, z} to each
> > vertex equal to it's global id number for rank 1, and the negative of
> > that for rank 0 (assigning the same values to x, y, and z). After
> > finishing the "ordered gather", rank 0 prints the global id and
> > coordinates of each vertex.
> >
> > lines should print (for example) as:
> >  6456 ;   6455.0   6455.0   6456.0
> >  6457 ;  -6457.0  -6457.0  -6457.0
> > depending on whether a vertex belongs only to rank 0 (negative
> > coordinates) or belongs to rank 1 (positive coordinates).
> >
> > With the OMPI 1.0.1 bug (observed on Suse Linux 10.0 with gcc 4.0 and on
> > Debian sarge with gcc 3.4), we have for example for the last vertices:
> >  7176 ;   7175.0   7175.0   7176.0
> >  7177 ;   7176.0   7176.0   7177.0
> > seeming to indicate an "off by one" type bug in datatype handling
> >
> > Not using an indexed datatype (i.e. not defining USE_INDEXED_DATATYPE
> > in the gather_test.c file), the bug dissapears.
> >
> > --
> >
> > Best regards,
> >
> >Yvan Fournier
> >
> >
> 
> "We must accept finite disappointment, but we must never lose infinite
> hope."
>Martin Luther King
> 
> 
> 
> --
> 
> ___

[OMPI users] Bug in OMPI 1.0.1 using MPI_Recv with indexed datatypes

[OMPI users] False positives and even failure with Open MPI and memchecker

Re: [OMPI users] False positives and even failure with OpenMPI and memchecker

Re: [OMPI users] Latest Intel Compilers (ICS, version 12.1.0.233 Build 20110811) issues

Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi

[OMPI users] Bug in Open MPI 1.2.3 using MPI_Recv with an indexed datatype

[OMPI users] bug in MPI_File_get_position_shared ?

Re: [OMPI users] bug in MPI_File_get_position_shared ?

[OMPI users] MPI IO bug test case for OpenMPI 1.3

[OMPI users] False positives with OpenMPI and memchecker

Re: [OMPI users] False positives with OpenMPI and memchecker (with attachment)

Re: [OMPI users] False positives with OpenMPI and memchecker (seems fixed between 3.0.0 and 3.0.1-rc1)

Re: [OMPI users] Incorrect results with MPI-IO under OpenMPI v1.3.1

[OMPI users] Datatype bug regression from Open MPI 1.0.2 to Open MPI 1.1

Re: [OMPI users] users Digest, Vol 328, Issue 1

15 matches

Site Navigation

Mail list logo

Footer information