[OMPI users] Bug in OMPI 1.0.1 using MPI_Recv with indexed datatypes
Hello, I seem to have encountered a bug in Open MPI 1.0 using indexed datatypes with MPI_Recv (which seems to be of the "off by one" sort). I have joined a test case, which is briefly explained below (as well as in the source file). This case should run on two processes. I observed the bug on 2 different Linux systems (single processor Centrino under Suse 10.0 with gcc 4.0.2, dual-processor Xeon under Debian Sarge with gcc 3.4) with Open MPI 1.0.1, and do not observe it using LAM 7.1.1 or MPICH2. Here is a summary of the case: -- Each processor reads a file ("data_p0" or "data_p1") giving a list of global element ids. Some elements (vertices from a partitionned mesh) may belong to both processors, so their id's may appear on both processors: we have 7178 global vertices, 3654 and 3688 of them being known by ranks 0 and 1 respectively. In this simplified version, we assign coordinates {x, y, z} to each vertex equal to it's global id number for rank 1, and the negative of that for rank 0 (assigning the same values to x, y, and z). After finishing the "ordered gather", rank 0 prints the global id and coordinates of each vertex. lines should print (for example) as: 6456 ; 6455.0 6455.0 6456.0 6457 ; -6457.0 -6457.0 -6457.0 depending on whether a vertex belongs only to rank 0 (negative coordinates) or belongs to rank 1 (positive coordinates). With the OMPI 1.0.1 bug (observed on Suse Linux 10.0 with gcc 4.0 and on Debian sarge with gcc 3.4), we have for example for the last vertices: 7176 ; 7175.0 7175.0 7176.0 7177 ; 7176.0 7176.0 7177.0 seeming to indicate an "off by one" type bug in datatype handling Not using an indexed datatype (i.e. not defining USE_INDEXED_DATATYPE in the gather_test.c file), the bug dissapears. Using the indexed datatype with LAM MPI 7.1.1 or MPICH2, we do not reproduce the bug either, so it does seem to be an Open MPI issue. -- Best regards, Yvan Fournier ompi_datatype_bug.tar.gz Description: application/compressed-tar
[OMPI users] False positives and even failure with Open MPI and memchecker
Hello, I have observed what seems to be false positives running under Valgrind when Open MPI is build with --enable-memchecker (at least with versions 1.10.4 and 2.0.1). Attached is a simple test case (extracted from larger code) that sends one int to rank r+1, and receives from rank r-1 (using MPI_COMM_NULL to handle ranks below 0 or above comm size). Using: ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_1 vg_mpi.c ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind ./a.out I get the following Valgrind error for rank 1: ==8382== Invalid read of size 4 ==8382==at 0x400A00: main (in /home/yvan/test/a.out) ==8382== Address 0xffefffe70 is on thread 1's stack ==8382== in frame #0, created by main (???:) Using: ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_2 vg_mpi.c ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind ./a.out I get the following Valgrind error for rank 1: ==8322== Invalid read of size 4 ==8322==at 0x400A6C: main (in /home/yvan/test/a.out) ==8322== Address 0xcb6f9a0 is 0 bytes inside a block of size 4 alloc'd ==8322==at 0x4C29BBE: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==8322==by 0x400998: main (in /home/yvan/test/a.out) I get no error for the default variant (no -D_VARIANT...) with either Open MPI 2.0.1, or 1.10.4, but de get an error similar to variant 1 on the parent code from which the example was extracted... is given below. Running under Valgrind's gdb server, for the parent code of variant 1, it even seems the value received on rank 1 is uninitialized, then Valgrind complains with the given message. The code fails to work as intended when run under Valgrind when OpenMPI is built with --enable-memchecker, while it works fine when run with the same build but not under Valgrind, or when run under Valgrind with Open MPI built without memchecker. I'm running under Arch Linux (whosed packaged Open MPI 1.10.4 is built with memchecker enabled, rendering it unusable under Valgrind). Did anybody else encounter this type of issue, or I does my code contain an obvious mistake that I am missing ? I initially though of possible alignment issues, but saw nothing in the standard that requires that, and the "malloc"-base variant exhibits the same behavior,while I assume alignment to 64-bits for allocated arrays is the default. Best regards, Yvan Fournier#include #include #include int main(int argc, char *argv[]) { MPI_Status status; int l = 5, l_prev = 0; int rank_next = MPI_PROC_NULL, rank_prev = MPI_PROC_NULL; int rank_id = 0, n_ranks = 1, tag = 1; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank_id); MPI_Comm_size(MPI_COMM_WORLD, &n_ranks); if (rank_id > 0) rank_prev = rank_id -1; if (rank_id + 1 < n_ranks) rank_next = rank_id + 1; #if defined(VARIANT_1) int sendbuf[1] = {l}; int recvbuf[1] = {0}; if (rank_id %2 == 0) { MPI_Send(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); MPI_Recv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status); } else { MPI_Recv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status); MPI_Send(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); } l_prev = recvbuf[0]; #elif defined(VARIANT_2) int *sendbuf = malloc(sizeof(int)); int *recvbuf = malloc(sizeof(int)); sendbuf[0] = l; if (rank_id %2 == 0) { MPI_Send(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); MPI_Recv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status); } else { MPI_Recv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status); MPI_Send(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); } l_prev = recvbuf[0]; #else if (rank_id %2 == 0) { MPI_Send(&l, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); MPI_Recv(&l_prev, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status); } else { MPI_Recv(&l_prev, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status); MPI_Send(&l, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); } #endif printf("r%d, l=%d\n"); MPI_Finalize(); exit(0); } ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] False positives and even failure with OpenMPI and memchecker
Hello, Yes, as I had hinted in the my message, I observed the bug in an irregular manner. Glad to see it could be fixed so quickly (it affects 2.0 too). I had observed it for some time, but only recently took the time to make a proper simplified case and investigate. Guess I should have submitted the issue sooner... Best regards, Yvan Fournier > Message: 5 > Date: Sat, 5 Nov 2016 22:08:32 +0900 > From: Gilles Gouaillardet > To: Open MPI Users > Subject: Re: [OMPI users] False positives and even failure with Open > MPI and memchecker > Message-ID: > > Content-Type: text/plain; charset=UTF-8 > > that really looks like a bug > > if you rewrite your program with > > MPI_Sendrecv(&l, 1, MPI_INT, rank_next, tag, &l_prev, 1, MPI_INT, > rank_prev, tag, MPI_COMM_WORLD, &status); > > or even > > MPI_Irecv(&l_prev, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &req); > > MPI_Send(&l, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); > > MPI_Wait(&req, &status); > > then there is no more valgrind warning > > iirc, Open MPI marks the receive buffer as invalid memory, so it can > check only MPI subroutine updates it. it looks like a step is missing > in the case of MPI_Recv() > > > Cheers, > > Gilles > > On Sat, Nov 5, 2016 at 9:48 PM, Gilles Gouaillardet > wrote: > > Hi, > > > > note your printf line is missing. > > if you printf l_prev, then the valgrind error occurs in all variants > > > > at first glance, it looks like a false positive, and i will investigate it > > > > > > Cheers, > > > > Gilles > > > > On Sat, Nov 5, 2016 at 7:59 PM, Yvan Fournier wrote: > > > Hello, > > > > > > I have observed what seems to be false positives running under Valgrind > > > when Open MPI is build with --enable-memchecker > > > (at least with versions 1.10.4 and 2.0.1). > > > > > > Attached is a simple test case (extracted from larger code) that sends one > > > int to rank r+1, and receives from rank r-1 > > > (using MPI_COMM_NULL to handle ranks below 0 or above comm size). > > > > > > Using: > > > > > > ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_1 vg_mpi.c > > > ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind > > > ./a.out > > > > > > I get the following Valgrind error for rank 1: > > > > > > ==8382== Invalid read of size 4 > > > ==8382==at 0x400A00: main (in /home/yvan/test/a.out) > > > ==8382== Address 0xffefffe70 is on thread 1's stack > > > ==8382== in frame #0, created by main (???:) > > > > > > > > > Using: > > > > > > ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_2 vg_mpi.c > > > ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind > > > ./a.out > > > > > > I get the following Valgrind error for rank 1: > > > > > > ==8322== Invalid read of size 4 > > > ==8322==at 0x400A6C: main (in /home/yvan/test/a.out) > > > ==8322== Address 0xcb6f9a0 is 0 bytes inside a block of size 4 alloc'd > > > ==8322==at 0x4C29BBE: malloc (in /usr/lib/valgrind/vgpreload_memcheck- > > > amd64-linux.so) > > > ==8322==by 0x400998: main (in /home/yvan/test/a.out) > > > > > > I get no error for the default variant (no -D_VARIANT...) with either Open > > > MPI 2.0.1, or 1.10.4, > > > but de get an error similar to variant 1 on the parent code from which the > > > example was extracted... > > > > > > is given below. Running under Valgrind's gdb server, for the parent code > > > of variant 1, > > > it even seems the value received on rank 1 is uninitialized, then Valgrind > > > complains > > > with the given message. > > > > > > The code fails to work as intended when run under Valgrind when OpenMPI is > > > built with --enable-memchecker, > > > while it works fine when run with the same build but not under Valgrind, > > > or when run under Valgrind with Open MPI built without memchecker. > > > > > > I'm running under Arch Linux (whosed packaged Open MPI 1.10.4 is built > > > with memchecker enabled, > > > rendering it unusable under Valgrind). > > > > > > Did anybody else encounter this type of issue, or I does my code contain > > > an obvious mistake that I am missing ? > > > I initially though of possible alignment issues, but saw nothing in the > > &g
Re: [OMPI users] Latest Intel Compilers (ICS, version 12.1.0.233 Build 20110811) issues
Hello, I am not sure your issues are related, and I have not tested this version of ICS, but I have actually had issues with an Intel compiler build of Open MPI 1.4.3 on a cluster using Westmere processors and Infiniband (Qlogic), using a Debian distribution, with our in-house code (www.code-saturne.org). I am not sure which version of the Intel compiler was used by the the administrators though, as both versions 11.? and 12.0 are available. On the same machine, using environment modules, I can run the code compiled with Intel compilers 11 and Open MPI compiled with GCC 4.4without issues, but if I switch to the Intel-compiled Open MPI build, I have issues in some functions of the code, including MPI-IO.I did use the wrappers, and they probably have some interaction with environment modules... I have not investigated further so far, and the code is quite complex, environment modules are used, and I am not sure of all the details of the machine, but as you seem to have issues with an Intel compiler, It may be useful to bring this up (the code is open-source, and I could provide a small test case to anyone interested in testing, but the only thing I am relatively confident of in this case is the code itself. We have had our share of bugs and experience in debugging, and the few times we have had issues similar to this, either bugs in the MPI libraries or conflicts between multiple compilers or libraries have been the origin). Best regards, > Message: 1 > Date: Tue, 3 Jan 2012 16:23:02 + > From: Richard Walsh > Subject: Re: [OMPI users] Latest Intel Compilers (ICS, version > 12.1.0.233 Build 20110811) issues ... > To: "ljdu...@scinet.utoronto.ca" , Open > MPI Users > Message-ID: > <762096c11c5a044a9d92961c2e1a7ce8192a4...@mbox1.flas.csi.cuny.edu> > Content-Type: text/plain; charset="us-ascii" > > > Jonathan/All, > > Thanks for the information, but I continue to have problems. I dropped the > 'openib' option to simplify things and focused my attention only on OpenMPI > version 1.4.4 because you suggested it works. > > On the strength of the fact that the PGI 11.10 compiler works fine (all > systems > and all versions of OpenMPI), I ran a PGI build of 1.4.4 with the '-showme' > option (Intel fails immediately, even with '-showme' ... ). I then > substituted all > the PGI-related strings with Intel-related strings to compile directly and > explicitly > outside the 'opal' wrapper using code and libraries in the Intel build tree > of 1.4.4, > as follows: > > pgcc -o ./hw2.exe hw2.c -I/share/apps/openmpi-pgi/1.4.4/include > -L/share/apps/openmpi-pgi/1.4.4/lib -lmpi -lopen-rte -lopen-pal -ldl > -Wl,--export-dynamic -lnsl -lutil -ldl > > becomes ... > > icc -o ./hw2.exe hw2.c -I/share/apps/openmpi-intel/1.4.4/include > -L/share/apps/openmpi-intel/1.4.4/lib -lmpi -lopen-rte -lopen-pal -ldl > -Wl,--export-dynamic -lnsl -lutil -ldl > > Interestingly, this direct-explicit Intel compile >>WORKS FINE<< (no segment > fault like with the wrapped version) > and the executable produced also >>RUNS FINE<<. So ... it looks to me like > there is something wrong with using > the 'opal' wrappper generated-used in the Intel build. > > Can someone make a suggestion ... ?? I would like to use the wrappers of > course. > > Thanks, > > rbw > > Richard Walsh > Parallel Applications and Systems Manager > CUNY HPC Center, Staten Island, NY > W: 718-982-3319 > M: 612-382-4620 > > Right, as the world goes, is only in question between equals in power, while > the strong do what they can and the weak suffer what they must. -- > Thucydides, 400 BC > > > From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of > Jonathan Dursi [ljdu...@scinet.utoronto.ca] > Sent: Tuesday, December 20, 2011 4:48 PM > To: Open Users > Subject: Re: [OMPI users] Latest Intel Compilers (ICS, version 12.1.0.233 > Build 20110811) issues ... > > For what it's worth, 1.4.4 built with the intel 12.1.0.233 compilers has been > the default mpi at our centre for over a month and we haven't had any > problems... > >- jonathan > -- > Jonathan Dursi; SciNet, Compute/Calcul Canada > > -Original Message- > From: Richard Walsh > Sender: users-boun...@open-mpi.org > Date: Tue, 20 Dec 2011 21:14:44 > To: Open MPI Users > Reply-To: Open MPI Users > Subject: Re: [OMPI users] Latest Intel Compilers (ICS, > version 12.1.0.233 Build 20110811) issues ... > > > All, > > I have not heard anything back on the inquiry below, so I take it > that no one has had any issues with Intel's latest compiler release, > or perhaps has not tried it yet. > > Thanks, > > rbw > > Richard Walsh > Parallel Applications and Systems Manager > CUNY HPC Center, Staten Island, NY > W: 718-982-3319 > M: 612-382-4620 > > Right, as the world goes, is only in question between equals in power, while > the strong do what they can and the weak suffer what they must.
Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi
On Jul 10, 2012, at 7:31 AM, Dugenoux Albert wrote: > > Hi. > ? > I have recently built a cluster upon a Dell PowerEdge Server with a Debian > 6.0 OS. This server is composed of > 4 system board of 2 processors of hexacores. So it gives 12 cores?per system > board. > The boards are linked with a local Gbits switch. > ? > In order to?parallelize the software Code Saturne, which is a CFD solver, I > have configured?the cluster > such that there are?a pbs server/mom on 1 system board and?3 mom and the?3 > others cards. So this leads to > 48 cores dispatched on?4 nodes of 12 CPU. Code saturne is compiled with the > openmpi 1.6 version. > ? > When I launch a simulation using 2 nodes?with 12 cores,?elapse time is good > and network traffic is not full. > But when I launch the same simulation using 3 nodes with 8 cores, elapse time > is 5 times the previous one. > I?both cases, I use 24 cores and network seems not to be satured. > ? > I have tested several configurations : binaries in local file system or?on a > NFS. But results are the same. > I have visited severals forums (in particular > http://www.open-mpi.org/community/lists/users/2009/08/10394.php) > and read lots of threads, but as I am not an expert at clusters, I presently > do not see where it?is wrong ! > ? > Is it a problem in the configuration of PBS (I have installed it from the deb > packages), a subtile compilation options > of openMPI, or a bad?network configuration?? > ? > Regards. > ? > B. S. > Hello, I am a Code_Saturne developer, so I can confirm a few comments from others on this list: - Most of the communication of the code is latency-bound: we use iterative linear solvers, which make a heavy use of MPI_Allreduce, with only 1 to 3 double precision values per reduction. I do not know if modern "fast" Ethernet variants on a small number of switches make a big difference, but tests made a few years ago on a Cluster using a a SCALI network (fast/low latency at the time) led to the conclusion that the code performance was divided by 2 on an Ethernet network. These tests need to be updated, but your results seem consistent. - Actually, on an Infiniband cluster using Open MPI 1.4.3 (such as the one described here: http://i.top500.org/system/177030), performance tends to be better in some cases when spreading a constant number of cores on more nodes, as the code is quite memory-bandwidth intensive. Depending on the data size on each node, this may be significant or lead to only minor performance differences. The network topology may also affect performance (tests using SLURMS --switches options confirms this), as well as binding processes to cores. - In recent years, the code has been used and tested mainly on workstations (shared memory), Infiniband clusters, or IBM Blue Gene (L, P, and Q) or a Cray XT (5 and 6) then XE-6 machine. I am interested in trying to improve or at least try to improve performance on Ethernet clusters, and I may have a few suggestions for options you can test, but this conversation should probably move to the Code_Saturne forum (http://code-saturne.org), as we will go into some options of our linear solvers which are specific to that code, not to Open MPI. Best regards, Yvan Fournier
[OMPI users] Bug in Open MPI 1.2.3 using MPI_Recv with an indexed datatype
Hello, I seem to have encountered a new bug in Open MPI 1.2.3 using indexed datatypes with MPI_Recv (which seems to be of the "off by one" sort). I have (different from the bug I submitted in 2006 and which was corrected since). This bug leads to a segfault, and I have only encountered it one one data set (on a relatively large set for a 2-processor run). I have reproduced the segfault on 2 different linux Systems (Debian Sarge on dual-processor Intel Xeon, Kubuntu 7.04 on single-processor Centrino system). A means to reproduce it on 2 ranks can be found at : http://yvan.fournier.free.fr/OpenMPI/ompi_datatype_bug_2.tar.gz (the program is very simple, but the displacements array required to reproduce it is too large for the mailing list). The program does not print any output, but does not segfault when functioning properly, or when USE_INDEXED_DATATYPE is unset (lines 57-58). It works with LAM 7.1.1 and MPICH2, but fails under Open MPI. This is a (much) simplified extract from a part of Code_Saturne's FVM library (http://rd.edf.com/code_saturne/), which otherwise works fine on most data using Open MPI. Best regards, Yvan Fournier
[OMPI users] bug in MPI_File_get_position_shared ?
Hello, I seem to have encountered a bug in MPI-IO, in which MPI_File_get_position_shared hangs when called by multiple processes in a communicator. It can be illustrated by the following simple test case, in which a file is simply created with C IO, and opened with MPI-IO. (defining or undefining MY_MPI_IO_BUG on line 5 enables/disables the bug). From the MPI2 documentation, It seems that all processes should be able to call MPI_File_get_position_shared, but if more than one process uses it, it fails. Setting the shared pointer helps, but this should not be necessary, and the code still hangs (in more complete code, after writing data). I encounter the same problem with Open MPI 1.2.6 and MPICH2 1.0.7, so I may have misread the documentation, but I suspect a ROMIO bug. Best regards, Yvan Fournier /* * Parallel file I/O shared pointer bug test **/ #define MY_MPI_IO_BUG 1 /* * Standard C library headers **/ #include #include #include #include #include /**/ #ifdef __cplusplus extern "C" { #if 0 } /* Fake brace to force Emacs auto-indentation back to column 0 */ #endif #endif /* __cplusplus */ /* * Private function definitions **/ /* * Output MPI error message. * * This supposes that the default MPI errorhandler is not used * * parameters: * error_code <-- associated MPI error code * * returns: * 0 in case of success, system error code in case of failure **/ static void _mpi_io_error_message(int error_code) { char buffer[MPI_MAX_ERROR_STRING]; int buffer_len; MPI_Error_string(error_code, buffer, &buffer_len); printf("MPI IO error %d: %s", error_code, buffer); } /* * Return the position of the file pointer. * * When using MPI-IO with individual file pointers, we consider the file * pointer to be equal to the highest value of then individual file pointers. * * parameters: * fh <-- MPI IO file descriptor * * returns: * current position of the file pointer **/ MPI_Offset _mpi_file_tell(MPI_File fh) { int errcode = MPI_SUCCESS; MPI_Offset offset = 0, disp = 0, retval = 0; int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); #if defined(MY_MPI_IO_BUG) printf("rank %d: will call MPI_File_get_position_shared\n", rank); errcode = MPI_File_get_position_shared(fh, &offset); if (errcode == MPI_SUCCESS) { MPI_File_get_byte_offset(fh, offset, &disp); retval = disp; } printf("rank %d: offsets: %ld %ld\n", rank, (long)offset, (long)disp); #else long aux[2]; if (rank == 0) { printf("root rank will call MPI_File_get_position_shared\n"); errcode = MPI_File_get_position_shared(fh, &offset); if (errcode == MPI_SUCCESS) { MPI_File_get_byte_offset(fh, offset, &disp); retval = disp; } aux[0] = disp; aux[1] = retval; } MPI_Bcast(aux, 2, MPI_LONG, 0, MPI_COMM_WORLD); disp = aux[0]; retval = aux[1]; printf("rank %d: offsets: %ld %ld\n", rank, (long)offset, (long)disp); #endif if (errcode != MPI_SUCCESS) _mpi_io_error_message(errcode); return retval; } /* * Unit test **/ static void _create_test_data(void) { int i; FILE *f; char header[80]; char footer[80]; sprintf(header, "fvm test file"); for (i = strlen(header); i < 80; i++) header[i] = '\0'; sprintf(footer, "fvm test file end"); for (i = strlen(footer); i < 80; i++) footer[i] = '\0'; f = fopen("file_test_data", "w+"); fwrite(header, 1, 80, f); fwrite(footer, 1, 80, f); fclose(f); } /*---*/ int main (int argc, char *argv[]) { int rank = 0; int retval = MPI_SUCCESS; MPI_Offset offset; MPI_File fh = MPI_FILE_NULL; /* Initialization */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) _create_test_data(); /* Open file
Re: [OMPI users] bug in MPI_File_get_position_shared ?
Thanks. I had also posted the bug on the MPICH2 list, and received an aswer from the ROMIO maintainers: the issue seems to be related to NFS file locking bugs. I had been testing on an NFS system, and when I re-tested under a local (ext3) file system, I did not reproduce the bug. I had been experimenting with the MPI-IO using explicit offsets, individual pointers, and shared pointers, and have workarounds, so I'll just avoid shared pointers on NFS. Best regards, Yvan Fournier EDF R&D On Sat, 2008-08-16 at 08:19 -0400, users-requ...@open-mpi.org wrote: > Date: Sat, 16 Aug 2008 08:05:14 -0400 > From: Jeff Squyres > Subject: Re: [OMPI users] bug in MPI_File_get_position_shared ? > To: Open MPI Users > Cc: mpich2-ma...@mcs.anl.gov > Message-ID: <023f1db0-8e8d-4c8c-8156-80ae52ff0...@cisco.com> > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > > On Aug 13, 2008, at 7:06 PM, Yvan Fournier wrote: > > > I seem to have encountered a bug in MPI-IO, in which > > MPI_File_get_position_shared hangs when called by multiple processes > > in > > a communicator. It can be illustrated by the following simple test > > case, > > in which a file is simply created with C IO, and opened with MPI-IO. > > (defining or undefining MY_MPI_IO_BUG on line 5 enables/disables the > > bug). From the MPI2 documentation, It seems that all processes > > should be > > able to call MPI_File_get_position_shared, but if more than one > > process > > uses it, it fails. Setting the shared pointer helps, but this should > > not > > be necessary, and the code still hangs (in more complete code, after > > writing data). > > > > I encounter the same problem with Open MPI 1.2.6 and MPICH2 1.0.7, so > > I may have misread the documentation, but I suspect a ROMIO bug. > > Bummer. :-( > > It would be best to report this directly to the ROMIO maintainers via > romio-ma...@mcs.anl.gov > . They lurk on this list, but they may not be paying attention to > every mail. > > If you wouldn't mind, please CC me on the mail to romio-maint. Thanks! > > -- > Jeff Squyres > Cisco Systems
[OMPI users] MPI IO bug test case for OpenMPI 1.3
Hello, Some weeks ago, I reported a problem using MPI IO in OpenMPI 1.3, which did not occur with OpenMPI 1.2 or MPICH2. The bug was encountered with the Code_Saturne CFD tool (http://www.code-saturne.org), and seemed to be an issue with individual file pointers, as another mode using explicit offsets worked fine. I have finally extracted the read pattern from the complete case, so as to generate the simple test case attached. Further testing showed that the bug could be reproduced easily using only part of the read pattern, so I commented most of the patterns from the original case using #if 0 / #endif. The test should be run with an MPI_COMM_WORLD size of 2. Initially, rank 0 generates a simple binary file using Posix I/O and containing the values 0, 1, 2, ... up to about 30. The file is then opened for reading using MPI IO, and as the values expected at a given offset are easily determined, read values are compared to expected values, and MPI_Abort is called in case of an error. I also added a USE_FILE_TYPE macro definition, which can be undefined to "turn off" the bug. Basically, I have: #ifdef USE_FILE_TYPE MPI_Type_hindexed(1, lengths, disps, MPI_BYTE, &file_type); MPI_Type_commit(&file_type); MPI_File_set_view(fh, offset, MPI_BYTE, file_type, datarep, MPI_INFO_NULL); #else MPI_File_set_view(fh, offset+disps[0], MPI_BYTE, MPI_BYTE, datarep, MPI_INFO_NULL); #endif retval = MPI_File_read_all(fh, buf, (int)(lengths[0]), MPI_BYTE, &status); #if USE_FILE_TYPE MPI_Type_free(&file_type); #endif - Using the file type indexed datatype, I exhibit the bug with both versions 1.3.0 and 1.3.2 of OpenMPI. Best regards, Yvan Fournier #include #include #include #include #define USE_FILE_TYPE 1 /* #undef USE_FILE_TYPE */ static void _create_test_data(void) { int i, j; FILE *f; int buf[1024]; f = fopen("test_data", "w"); for (i = 0; i < 300; i++) { for (j = 0; j < 1024; j++) buf[j] = i*1024 + j; fwrite(buf, sizeof(int), 1024, f); } fclose(f); } static void _mpi_io_error_message(int error_code) { char buffer[MPI_MAX_ERROR_STRING]; int buffer_len; MPI_Error_string(error_code, buffer, &buffer_len); fprintf(stderr, "MPI IO error: %s\n", buffer); } static void _test_for_corruption(int buf[], int base_offset, int rank_offset, int ni) { int i; int n_ints = ni / sizeof(int); int int_shift = (base_offset + rank_offset) / sizeof(int); for (i = 0; i < n_ints; i++) { if (buf[i] != int_shift + i) { int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("i = %d, buf = %d, ref = %d\n", i, buf[i], int_shift + i); fprintf(stderr, "rank %d, base offset %d, rank offset %d, size %d: corruption\n", rank, base_offset, rank_offset, ni); MPI_Abort(MPI_COMM_WORLD, 1); } } } static void _read_global_block(MPI_File fh, intoffset, intni) { MPI_Datatype file_type; MPI_Aint disps[1]; MPI_Status status; int *buf; int lengths[1]; char datarep[] = "native"; int retval = 0; lengths[0] = ni; disps[0] = 0; buf = malloc(ni); assert(buf != NULL); MPI_Type_hindexed(1, lengths, disps, MPI_BYTE, &file_type); MPI_Type_commit(&file_type); MPI_File_set_view(fh, offset, MPI_BYTE, file_type, datarep, MPI_INFO_NULL); retval = MPI_File_read_all(fh, buf, ni, MPI_BYTE, &status); MPI_Type_free(&file_type); if (retval != MPI_SUCCESS) _mpi_io_error_message(retval); _test_for_corruption(buf, offset, 0, ni); free(buf); } static void _read_block_ip(MPI_File fh, intoffset, intdispl, intni) { int errcode; int *buf; int lengths[1]; MPI_Aint disps[1]; MPI_Status status; MPI_Datatype file_type; char datarep[] = "native"; int retval = 0; buf = malloc(ni); assert(buf != NULL); lengths[0] = ni; disps[0] = displ; #ifdef USE_FILE_TYPE MPI_Type_hindexed(1, lengths, disps, MPI_BYTE, &file_type); MPI_Type_commit(&file_type); MPI_File_set_view(fh, offset, MPI_BYTE, file_type, datarep, MPI_INFO_NULL); #else MPI_File_set_view(fh, offset+displ, MPI_BYTE, MPI_BYTE, datarep, MPI_INFO_NULL); #endif retval = MPI_File_read_all(fh, buf, (int)(lengths[0]), MPI_BYTE, &status); if (retval != MPI_SUCCESS) _mpi_io_error_message(retval); #if USE_FILE_TYPE MPI_Type_free(&file_type); #endif _test_for_corruption(buf, offset, displ, ni); free(buf); } int main(int argc, char **argv) { int rank; int retval; MPI_File fh; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { _create_test_data(); }
[OMPI users] False positives with OpenMPI and memchecker
Hello, I obtain false positives with OpenMPI when memcheck is enabled, using OpenMPI 3.0.0 This is similar to an issue I had reported and had been fixed in Nov. 2016, but affects MPI_Isend/MPI_Irecv instead of MPI_Send/MPI_Recv. I had not done much additional testing on my application using memchecker since, so probably may have missed remaining issues at the time. In the attached test (which has 2 optional variants relating to whether the send and receive buffers are allocated on the stack or heap, but exhibit the same basic issue), I have (running "mpicc vg_ompi_isend_irecv.c && -g mpiexec -n 2 ./a.out"): ==19651== Memcheck, a memory error detector ==19651== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==19651== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==19651== Command: ./a.out ==19651== ==19650== Thread 3: ==19650== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s) ==19650==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so) ==19650==by 0x5A5A9FA: epoll_dispatch (epoll.c:407) ==19650==by 0x5A5EA9A: opal_libevent2022_event_base_loop (event.c:1630) ==19650==by 0x94C96ED: progress_engine (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pmix_pmix2x.so) ==19650==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so) ==19650==by 0x547042E: clone (in /usr/lib/libc-2.26.so) ==19650== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==19650== ==19651== Thread 3: ==19651== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s) ==19651==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so) ==19651==by 0x5A5A9FA: epoll_dispatch (epoll.c:407) ==19651==by 0x5A5EA9A: opal_libevent2022_event_base_loop (event.c:1630) ==19651==by 0x94C96ED: progress_engine (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pmix_pmix2x.so) ==19651==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so) ==19651==by 0x547042E: clone (in /usr/lib/libc-2.26.so) ==19651== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==19651== ==19650== Thread 1: ==19650== Invalid read of size 2 ==19650==at 0x4C33BA0: memmove (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==19650==by 0x5A27C85: opal_convertor_pack (in /home/yvan/opt/openmpi-3.0/lib/libopen-pal.so.40.0.0) ==19650==by 0xD177EF1: mca_btl_vader_sendi (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_btl_vader.so) ==19650==by 0xE1A7F31: mca_pml_ob1_send_inline.constprop.4 (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so) ==19650==by 0xE1A8711: mca_pml_ob1_isend (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so) ==19650==by 0x4EB4C83: PMPI_Isend (in /home/yvan/opt/openmpi-3.0/lib/libmpi.so.40.0.0) ==19650==by 0x108B24: main (vg_ompi_isend_irecv.c:63) ==19650== Address 0x1ffefffcc4 is on thread 1's stack ==19650== in frame #6, created by main (vg_ompi_isend_irecv.c:7) The first 2 warnings seem to relate to initialization, so are not a big issue, but the last one occurs whenever I use MPI_Isend, so they are a more important issue. Using a version built without --enable-memchecker, I also have the two initialization warnings, but not the warning from MPI_Isend... Best regards, Yvan Fournier ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] False positives with OpenMPI and memchecker (with attachment)
Hello, Sorry, I forgot the attached test case in my previous message... :( Best regards, Yvan Fournier - Mail transferred - From: "yvan fournier" To: users@lists.open-mpi.org Sent: Sunday January 7 2018 01:43:16 Object: False positives with OpenMPI and memchecker Hello, I obtain false positives with OpenMPI when memcheck is enabled, using OpenMPI 3.0.0 This is similar to an issue I had reported and had been fixed in Nov. 2016, but affects MPI_Isend/MPI_Irecv instead of MPI_Send/MPI_Recv. I had not done much additional testing on my application using memchecker since, so probably may have missed remaining issues at the time. In the attached test (which has 2 optional variants relating to whether the send and receive buffers are allocated on the stack or heap, but exhibit the same basic issue), I have (running "mpicc vg_ompi_isend_irecv.c && -g mpiexec -n 2 ./a.out"): ==19651== Memcheck, a memory error detector ==19651== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==19651== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==19651== Command: ./a.out ==19651== ==19650== Thread 3: ==19650== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s) ==19650==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so) ==19650==by 0x5A5A9FA: epoll_dispatch (epoll.c:407) ==19650==by 0x5A5EA9A: opal_libevent2022_event_base_loop (event.c:1630) ==19650==by 0x94C96ED: progress_engine (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pmix_pmix2x.so) ==19650==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so) ==19650==by 0x547042E: clone (in /usr/lib/libc-2.26.so) ==19650== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==19650== ==19651== Thread 3: ==19651== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s) ==19651==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so) ==19651==by 0x5A5A9FA: epoll_dispatch (epoll.c:407) ==19651==by 0x5A5EA9A: opal_libevent2022_event_base_loop (event.c:1630) ==19651==by 0x94C96ED: progress_engine (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pmix_pmix2x.so) ==19651==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so) ==19651==by 0x547042E: clone (in /usr/lib/libc-2.26.so) ==19651== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==19651== ==19650== Thread 1: ==19650== Invalid read of size 2 ==19650==at 0x4C33BA0: memmove (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==19650==by 0x5A27C85: opal_convertor_pack (in /home/yvan/opt/openmpi-3.0/lib/libopen-pal.so.40.0.0) ==19650==by 0xD177EF1: mca_btl_vader_sendi (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_btl_vader.so) ==19650==by 0xE1A7F31: mca_pml_ob1_send_inline.constprop.4 (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so) ==19650==by 0xE1A8711: mca_pml_ob1_isend (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so) ==19650==by 0x4EB4C83: PMPI_Isend (in /home/yvan/opt/openmpi-3.0/lib/libmpi.so.40.0.0) ==19650==by 0x108B24: main (vg_ompi_isend_irecv.c:63) ==19650== Address 0x1ffefffcc4 is on thread 1's stack ==19650== in frame #6, created by main (vg_ompi_isend_irecv.c:7) The first 2 warnings seem to relate to initialization, so are not a big issue, but the last one occurs whenever I use MPI_Isend, so they are a more important issue. Using a version built without --enable-memchecker, I also have the two initialization warnings, but not the warning from MPI_Isend... Best regards, Yvan Fournier #include #include #include int main(int argc, char *argv[]) { MPI_Request request[2]; MPI_Status status[2]; int l = 5, l_prev = 0; int rank_next = MPI_PROC_NULL, rank_prev = MPI_PROC_NULL; int rank_id = 0, n_ranks = 1, tag = 1; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank_id); MPI_Comm_size(MPI_COMM_WORLD, &n_ranks); if (rank_id > 0) rank_prev = rank_id -1; if (rank_id + 1 < n_ranks) rank_next = rank_id + 1; #if defined(VARIANT_1) int sendbuf[1] = {l}; int recvbuf[1] = {0}; if (rank_id %2 == 0) { MPI_Isend(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD, &(request[0])); MPI_Irecv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &(request[1])); } else { MPI_Irecv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &(request[0])); MPI_Isend(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD, &(request[1])); } MPI_Waitall(2, request, status); l_prev = recvbuf[0]; #elif defined(VARIANT_2) int *sendbuf = malloc(sizeof(int)); int *recvbuf = malloc(sizeof(int)); sendbuf[0] = l; if (rank_id %2 == 0) { MPI_Isend(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD, &(request[0])); MPI_Irecv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &(request[1])); } else { MPI_Irecv(recv
Re: [OMPI users] False positives with OpenMPI and memchecker (seems fixed between 3.0.0 and 3.0.1-rc1)
Hello, Answering myself here: checking the revision history, commits 3b8b8c52c519f64cb3ff147db49fcac7cbd0e7d7 or 66c9485e77f7da9a212ae67c88a21f95f13e6652 (in master) seem to relate to this, so I checked using the latest downloadable 3.0.x nightly release, and do not reproduce the issue anymore... Sorry for the (too-late) report... Yvan - Mail original - From: "yvan fournier" To: users@lists.open-mpi.org Sent: Sunday January 7 2018 01:52:04 Object: Re: False positives with OpenMPI and memchecker (with attachment) Hello, Sorry, I forgot the attached test case in my previous message... :( Best regards, Yvan Fournier - Mail transferred ----- From: "yvan fournier" To: users@lists.open-mpi.org Sent: Sunday January 7 2018 01:43:16 Object: False positives with OpenMPI and memchecker Hello, I obtain false positives with OpenMPI when memcheck is enabled, using OpenMPI 3.0.0 This is similar to an issue I had reported and had been fixed in Nov. 2016, but affects MPI_Isend/MPI_Irecv instead of MPI_Send/MPI_Recv. I had not done much additional testing on my application using memchecker since, so probably may have missed remaining issues at the time. In the attached test (which has 2 optional variants relating to whether the send and receive buffers are allocated on the stack or heap, but exhibit the same basic issue), I have (running "mpicc vg_ompi_isend_irecv.c && -g mpiexec -n 2 ./a.out"): ==19651== Memcheck, a memory error detector ==19651== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==19651== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==19651== Command: ./a.out ==19651== ==19650== Thread 3: ==19650== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s) ==19650==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so) ==19650==by 0x5A5A9FA: epoll_dispatch (epoll.c:407) ==19650==by 0x5A5EA9A: opal_libevent2022_event_base_loop (event.c:1630) ==19650==by 0x94C96ED: progress_engine (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pmix_pmix2x.so) ==19650==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so) ==19650==by 0x547042E: clone (in /usr/lib/libc-2.26.so) ==19650== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==19650== ==19651== Thread 3: ==19651== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s) ==19651==at 0x5470596: epoll_pwait (in /usr/lib/libc-2.26.so) ==19651==by 0x5A5A9FA: epoll_dispatch (epoll.c:407) ==19651==by 0x5A5EA9A: opal_libevent2022_event_base_loop (event.c:1630) ==19651==by 0x94C96ED: progress_engine (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pmix_pmix2x.so) ==19651==by 0x5163089: start_thread (in /usr/lib/libpthread-2.26.so) ==19651==by 0x547042E: clone (in /usr/lib/libc-2.26.so) ==19651== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==19651== ==19650== Thread 1: ==19650== Invalid read of size 2 ==19650==at 0x4C33BA0: memmove (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==19650==by 0x5A27C85: opal_convertor_pack (in /home/yvan/opt/openmpi-3.0/lib/libopen-pal.so.40.0.0) ==19650==by 0xD177EF1: mca_btl_vader_sendi (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_btl_vader.so) ==19650==by 0xE1A7F31: mca_pml_ob1_send_inline.constprop.4 (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so) ==19650==by 0xE1A8711: mca_pml_ob1_isend (in /home/yvan/opt/openmpi-3.0/lib/openmpi/mca_pml_ob1.so) ==19650==by 0x4EB4C83: PMPI_Isend (in /home/yvan/opt/openmpi-3.0/lib/libmpi.so.40.0.0) ==19650==by 0x108B24: main (vg_ompi_isend_irecv.c:63) ==19650== Address 0x1ffefffcc4 is on thread 1's stack ==19650== in frame #6, created by main (vg_ompi_isend_irecv.c:7) The first 2 warnings seem to relate to initialization, so are not a big issue, but the last one occurs whenever I use MPI_Isend, so they are a more important issue. Using a version built without --enable-memchecker, I also have the two initialization warnings, but not the warning from MPI_Isend... Best regards, Yvan Fournier ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Incorrect results with MPI-IO under OpenMPI v1.3.1
Hello to all, I have also encountered a similar bug with MPI-IO with Open MPI 1.3.1, reading a Code_Saturne preprocessed mesh file (www.code-saturne.org). Reading the file can be done using 2 MPI-IO modes, or one non-MPI-IO mode. The first MPI-IO mode uses individual file pointers, and involves a series of MPI_File_Read_all with all ranks using the same view (for record headers), interlaced with MPI_File_Read_all with ranks using different views (for record data, successive blocks being read by each rank). The second MPI-IO mode uses explicit file offsets, with MPI_File_read_at_all instead of MPI_File_read_all. Both MPI-IO modes seem to work fine with OpenMPI 1.2, MPICH 2, and variants on IBM Blue Gene/L and P, as well as Bull Novascale, but with OpenMPI 1.3.1, data read seems to be corrupt on at least one file using the individual file pointers approach (though it works well using explicit offsets). The bug does not appear in unit tests, and it only appears after several records are read on the case that does fail (on 2 ranks), so to reproduce it with a simple program, I would have to extract the exact file access patterns from the exact case which fails, which would require a few extra hours of work. If the bug is not reproduced in a simpler manner first, I will try to build a simple program reproducing the bug within a week or 2, but In the meantime, I just want to confirm Scott's observation (hoping it is the same bug). Best regards, Yvan Fournier On Mon, 2009-04-06 at 16:03 -0400, users-requ...@open-mpi.org wrote: > Date: Mon, 06 Apr 2009 12:16:18 -0600 > From: Scott Collis > Subject: [OMPI users] Incorrect results with MPI-IO under OpenMPI > v1.3.1 > To: us...@open-mpi.org > Message-ID: > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > > I have been a user of MPI-IO for 4+ years and have a code that has run > correctly with MPICH, MPICH2, and OpenMPI 1.2.* > > I recently upgraded to OpenMPI 1.3.1 and immediately noticed that my > MPI-IO generated output files are corrupted. I have not yet had a > chance to debug this in detail, but it appears that > MPI_File_write_all() commands are not placing information correctly on > their file_view when running with more than 1 processor (everything is > okay with -np 1). > > Note that I have observed the same incorrect behavior on both Linux > and OS-X. I have also gone back and made sure that the same code > works with MPICH, MPICH2, and OpenMPI 1.2.* so I'm fairly confident > that something has been changed or broken as of OpenMPI 1.3.*. Just > today, I checked out the SVN repository version of OpenMPI and built > and tested my code with that and the results are incorrect just as for > the 1.3.1 tarball. > > While I plan to continue to debug this and will try to put together a > small test that demonstrates the issue, I thought that I would first > send out this message to see if this might trigger a thought within > the OpenMPI development team as to where this issue might be. > > Please let me know if you have any ideas as I would very much > appreciate it! > > Thanks in advance, > > Scott > -- > Scott Collis > sscol...@me.com >
[OMPI users] Datatype bug regression from Open MPI 1.0.2 to Open MPI 1.1
Hello, I had encountered a bug in Open MPI 1.0.1 using indexed datatypes with MPI_Recv (which seems to be of the "off by one" sort), which was corrected in Open MPI 1.0.2. It seems to have resurfaced in Open MPI 1.1 (I encountered it using different data and did not recognize it immediately, but it seems it can reproduced using the same simplified test I had sent the first time, which I re-attach here just in case). Here is a summary of the case: -- Each processor reads a file ("data_p0" or "data_p1") giving a list of global element ids. Some elements (vertices from a partitionned mesh) may belong to both processors, so their id's may appear on both processors: we have 7178 global vertices, 3654 and 3688 of them being known by ranks 0 and 1 respectively. In this simplified version, we assign coordinates {x, y, z} to each vertex equal to it's global id number for rank 1, and the negative of that for rank 0 (assigning the same values to x, y, and z). After finishing the "ordered gather", rank 0 prints the global id and coordinates of each vertex. lines should print (for example) as: 6456 ; 6455.0 6455.0 6456.0 6457 ; -6457.0 -6457.0 -6457.0 depending on whether a vertex belongs only to rank 0 (negative coordinates) or belongs to rank 1 (positive coordinates). With the OMPI 1.0.1 bug (observed on Suse Linux 10.0 with gcc 4.0 and on Debian sarge with gcc 3.4), we have for example for the last vertices: 7176 ; 7175.0 7175.0 7176.0 7177 ; 7176.0 7176.0 7177.0 seeming to indicate an "off by one" type bug in datatype handling Not using an indexed datatype (i.e. not defining USE_INDEXED_DATATYPE in the gather_test.c file), the bug dissapears. -- Best regards, Yvan Fournier ompi_datatype_bug.tar.gz Description: application/compressed-tar
Re: [OMPI users] users Digest, Vol 328, Issue 1
Hello, I just retried replicating the datatype bug on a SUSE Linux 10.1 system (on a 32-bit Pentium-M system). Actually, I even get a segmentation fault at some point. I attach the logfile for the test case compiled in debug mode, run once directly, the again with valgrind, as well as my ompi_info output. I have also encountered the bug on the "parent" case (similar, but more complex) on my work machine (dual Xeon under Debian Sarge), but I'll check this simpler test on it just in case. Best regards, Yvan Fournier On Sun, 2006-07-09 at 12:00 -0400, users-requ...@open-mpi.org wrote: > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > >1. Re: Datatype bug regression from Open MPI 1.0.2 to Open MPI > 1.1 (George Bosilca) > > > -- > > Message: 1 > Date: Sat, 8 Jul 2006 13:47:05 -0400 (Eastern Daylight Time) > From: George Bosilca > Subject: Re: [OMPI users] Datatype bug regression from Open MPI 1.0.2 > to Open MPI 1.1 > To: Open MPI Users > Message-ID: > Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed > > Yvan, > > I'm unable to replicate this one with the latest Open MPI trunk version. > As there is no difference between the trunk and the latest 1.1 version on > the datatype, I think the bug cannot be reproduced using the 1.1 either. I > compiled the test twice once using the indexed datatype and once without > and the output is exactly the same. I run it on my Apple G5 desktop as > well as on a cluster of AMD 64, over shared memory and TCP. Can you please > recheck that your error is comming from the type indexed please. > >Thanks, > george. > > > On Sat, 1 Jul 2006, Yvan Fournier wrote: > > > Hello, > > > > I had encountered a bug in Open MPI 1.0.1 using indexed datatypes > > with MPI_Recv (which seems to be of the "off by one" sort), which > > was corrected in Open MPI 1.0.2. > > > > It seems to have resurfaced in Open MPI 1.1 (I encountered it using > > different data and did not recognize it immediately, but it seems > > it can reproduced using the same simplified test I had sent > > the first time, which I re-attach here just in case). > > > > Here is a summary of the case: > > > > -- > > > > Each processor reads a file ("data_p0" or "data_p1") giving a list of > > global element ids. Some elements (vertices from a partitionned mesh) > > may belong to both processors, so their id's may appear on both > > processors: we have 7178 global vertices, 3654 and 3688 of them being > > known by ranks 0 and 1 respectively. > > > > In this simplified version, we assign coordinates {x, y, z} to each > > vertex equal to it's global id number for rank 1, and the negative of > > that for rank 0 (assigning the same values to x, y, and z). After > > finishing the "ordered gather", rank 0 prints the global id and > > coordinates of each vertex. > > > > lines should print (for example) as: > > 6456 ; 6455.0 6455.0 6456.0 > > 6457 ; -6457.0 -6457.0 -6457.0 > > depending on whether a vertex belongs only to rank 0 (negative > > coordinates) or belongs to rank 1 (positive coordinates). > > > > With the OMPI 1.0.1 bug (observed on Suse Linux 10.0 with gcc 4.0 and on > > Debian sarge with gcc 3.4), we have for example for the last vertices: > > 7176 ; 7175.0 7175.0 7176.0 > > 7177 ; 7176.0 7176.0 7177.0 > > seeming to indicate an "off by one" type bug in datatype handling > > > > Not using an indexed datatype (i.e. not defining USE_INDEXED_DATATYPE > > in the gather_test.c file), the bug dissapears. > > > > -- > > > > Best regards, > > > >Yvan Fournier > > > > > > "We must accept finite disappointment, but we must never lose infinite > hope." >Martin Luther King > > > > -- > > ___