[OMPI users] False positives and even failure with Open MPI and memchecker
Hello, I have observed what seems to be false positives running under Valgrind when Open MPI is build with --enable-memchecker (at least with versions 1.10.4 and 2.0.1). Attached is a simple test case (extracted from larger code) that sends one int to rank r+1, and receives from rank r-1 (using MPI_COMM_NULL to handle ranks below 0 or above comm size). Using: ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_1 vg_mpi.c ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind ./a.out I get the following Valgrind error for rank 1: ==8382== Invalid read of size 4 ==8382==at 0x400A00: main (in /home/yvan/test/a.out) ==8382== Address 0xffefffe70 is on thread 1's stack ==8382== in frame #0, created by main (???:) Using: ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_2 vg_mpi.c ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind ./a.out I get the following Valgrind error for rank 1: ==8322== Invalid read of size 4 ==8322==at 0x400A6C: main (in /home/yvan/test/a.out) ==8322== Address 0xcb6f9a0 is 0 bytes inside a block of size 4 alloc'd ==8322==at 0x4C29BBE: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==8322==by 0x400998: main (in /home/yvan/test/a.out) I get no error for the default variant (no -D_VARIANT...) with either Open MPI 2.0.1, or 1.10.4, but de get an error similar to variant 1 on the parent code from which the example was extracted... is given below. Running under Valgrind's gdb server, for the parent code of variant 1, it even seems the value received on rank 1 is uninitialized, then Valgrind complains with the given message. The code fails to work as intended when run under Valgrind when OpenMPI is built with --enable-memchecker, while it works fine when run with the same build but not under Valgrind, or when run under Valgrind with Open MPI built without memchecker. I'm running under Arch Linux (whosed packaged Open MPI 1.10.4 is built with memchecker enabled, rendering it unusable under Valgrind). Did anybody else encounter this type of issue, or I does my code contain an obvious mistake that I am missing ? I initially though of possible alignment issues, but saw nothing in the standard that requires that, and the "malloc"-base variant exhibits the same behavior,while I assume alignment to 64-bits for allocated arrays is the default. Best regards, Yvan Fournier#include #include #include int main(int argc, char *argv[]) { MPI_Status status; int l = 5, l_prev = 0; int rank_next = MPI_PROC_NULL, rank_prev = MPI_PROC_NULL; int rank_id = 0, n_ranks = 1, tag = 1; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank_id); MPI_Comm_size(MPI_COMM_WORLD, &n_ranks); if (rank_id > 0) rank_prev = rank_id -1; if (rank_id + 1 < n_ranks) rank_next = rank_id + 1; #if defined(VARIANT_1) int sendbuf[1] = {l}; int recvbuf[1] = {0}; if (rank_id %2 == 0) { MPI_Send(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); MPI_Recv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status); } else { MPI_Recv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status); MPI_Send(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); } l_prev = recvbuf[0]; #elif defined(VARIANT_2) int *sendbuf = malloc(sizeof(int)); int *recvbuf = malloc(sizeof(int)); sendbuf[0] = l; if (rank_id %2 == 0) { MPI_Send(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); MPI_Recv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status); } else { MPI_Recv(recvbuf, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status); MPI_Send(sendbuf, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); } l_prev = recvbuf[0]; #else if (rank_id %2 == 0) { MPI_Send(&l, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); MPI_Recv(&l_prev, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status); } else { MPI_Recv(&l_prev, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status); MPI_Send(&l, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); } #endif printf("r%d, l=%d\n"); MPI_Finalize(); exit(0); } ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] False positives and even failure with Open MPI and memchecker
Hi, note your printf line is missing. if you printf l_prev, then the valgrind error occurs in all variants at first glance, it looks like a false positive, and i will investigate it Cheers, Gilles On Sat, Nov 5, 2016 at 7:59 PM, Yvan Fournier wrote: > Hello, > > I have observed what seems to be false positives running under Valgrind when > Open MPI is build with --enable-memchecker > (at least with versions 1.10.4 and 2.0.1). > > Attached is a simple test case (extracted from larger code) that sends one > int to rank r+1, and receives from rank r-1 > (using MPI_COMM_NULL to handle ranks below 0 or above comm size). > > Using: > > ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_1 vg_mpi.c > ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind ./a.out > > I get the following Valgrind error for rank 1: > > ==8382== Invalid read of size 4 > ==8382==at 0x400A00: main (in /home/yvan/test/a.out) > ==8382== Address 0xffefffe70 is on thread 1's stack > ==8382== in frame #0, created by main (???:) > > > Using: > > ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_2 vg_mpi.c > ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind ./a.out > > I get the following Valgrind error for rank 1: > > ==8322== Invalid read of size 4 > ==8322==at 0x400A6C: main (in /home/yvan/test/a.out) > ==8322== Address 0xcb6f9a0 is 0 bytes inside a block of size 4 alloc'd > ==8322==at 0x4C29BBE: malloc (in > /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) > ==8322==by 0x400998: main (in /home/yvan/test/a.out) > > I get no error for the default variant (no -D_VARIANT...) with either Open > MPI 2.0.1, or 1.10.4, > but de get an error similar to variant 1 on the parent code from which the > example was extracted... > > is given below. Running under Valgrind's gdb server, for the parent code of > variant 1, > it even seems the value received on rank 1 is uninitialized, then Valgrind > complains > with the given message. > > The code fails to work as intended when run under Valgrind when OpenMPI is > built with --enable-memchecker, > while it works fine when run with the same build but not under Valgrind, > or when run under Valgrind with Open MPI built without memchecker. > > I'm running under Arch Linux (whosed packaged Open MPI 1.10.4 is built with > memchecker enabled, > rendering it unusable under Valgrind). > > Did anybody else encounter this type of issue, or I does my code contain an > obvious mistake that I am missing ? > I initially though of possible alignment issues, but saw nothing in the > standard that requires that, > and the "malloc"-base variant exhibits the same behavior,while I assume > alignment to 64-bits for allocated arrays is the default. > > Best regards, > > Yvan Fournier > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] False positives and even failure with Open MPI and memchecker
that really looks like a bug if you rewrite your program with MPI_Sendrecv(&l, 1, MPI_INT, rank_next, tag, &l_prev, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status); or even MPI_Irecv(&l_prev, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &req); MPI_Send(&l, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); MPI_Wait(&req, &status); then there is no more valgrind warning iirc, Open MPI marks the receive buffer as invalid memory, so it can check only MPI subroutine updates it. it looks like a step is missing in the case of MPI_Recv() Cheers, Gilles On Sat, Nov 5, 2016 at 9:48 PM, Gilles Gouaillardet wrote: > Hi, > > note your printf line is missing. > if you printf l_prev, then the valgrind error occurs in all variants > > at first glance, it looks like a false positive, and i will investigate it > > > Cheers, > > Gilles > > On Sat, Nov 5, 2016 at 7:59 PM, Yvan Fournier wrote: >> Hello, >> >> I have observed what seems to be false positives running under Valgrind when >> Open MPI is build with --enable-memchecker >> (at least with versions 1.10.4 and 2.0.1). >> >> Attached is a simple test case (extracted from larger code) that sends one >> int to rank r+1, and receives from rank r-1 >> (using MPI_COMM_NULL to handle ranks below 0 or above comm size). >> >> Using: >> >> ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_1 vg_mpi.c >> ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind ./a.out >> >> I get the following Valgrind error for rank 1: >> >> ==8382== Invalid read of size 4 >> ==8382==at 0x400A00: main (in /home/yvan/test/a.out) >> ==8382== Address 0xffefffe70 is on thread 1's stack >> ==8382== in frame #0, created by main (???:) >> >> >> Using: >> >> ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_2 vg_mpi.c >> ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind ./a.out >> >> I get the following Valgrind error for rank 1: >> >> ==8322== Invalid read of size 4 >> ==8322==at 0x400A6C: main (in /home/yvan/test/a.out) >> ==8322== Address 0xcb6f9a0 is 0 bytes inside a block of size 4 alloc'd >> ==8322==at 0x4C29BBE: malloc (in >> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) >> ==8322==by 0x400998: main (in /home/yvan/test/a.out) >> >> I get no error for the default variant (no -D_VARIANT...) with either Open >> MPI 2.0.1, or 1.10.4, >> but de get an error similar to variant 1 on the parent code from which the >> example was extracted... >> >> is given below. Running under Valgrind's gdb server, for the parent code of >> variant 1, >> it even seems the value received on rank 1 is uninitialized, then Valgrind >> complains >> with the given message. >> >> The code fails to work as intended when run under Valgrind when OpenMPI is >> built with --enable-memchecker, >> while it works fine when run with the same build but not under Valgrind, >> or when run under Valgrind with Open MPI built without memchecker. >> >> I'm running under Arch Linux (whosed packaged Open MPI 1.10.4 is built with >> memchecker enabled, >> rendering it unusable under Valgrind). >> >> Did anybody else encounter this type of issue, or I does my code contain an >> obvious mistake that I am missing ? >> I initially though of possible alignment issues, but saw nothing in the >> standard that requires that, >> and the "malloc"-base variant exhibits the same behavior,while I assume >> alignment to 64-bits for allocated arrays is the default. >> >> Best regards, >> >> Yvan Fournier >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] False positives and even failure with Open MPI and memchecker
so it seems we took some shortcuts in pml/ob1 the attached patch (for the v1.10 branch) should fix this issue Cheers Gilles On Sat, Nov 5, 2016 at 10:08 PM, Gilles Gouaillardet wrote: > that really looks like a bug > > if you rewrite your program with > > MPI_Sendrecv(&l, 1, MPI_INT, rank_next, tag, &l_prev, 1, MPI_INT, > rank_prev, tag, MPI_COMM_WORLD, &status); > > or even > > MPI_Irecv(&l_prev, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &req); > > MPI_Send(&l, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); > > MPI_Wait(&req, &status); > > then there is no more valgrind warning > > iirc, Open MPI marks the receive buffer as invalid memory, so it can > check only MPI subroutine updates it. it looks like a step is missing > in the case of MPI_Recv() > > > Cheers, > > Gilles > > On Sat, Nov 5, 2016 at 9:48 PM, Gilles Gouaillardet > wrote: >> Hi, >> >> note your printf line is missing. >> if you printf l_prev, then the valgrind error occurs in all variants >> >> at first glance, it looks like a false positive, and i will investigate it >> >> >> Cheers, >> >> Gilles >> >> On Sat, Nov 5, 2016 at 7:59 PM, Yvan Fournier wrote: >>> Hello, >>> >>> I have observed what seems to be false positives running under Valgrind >>> when Open MPI is build with --enable-memchecker >>> (at least with versions 1.10.4 and 2.0.1). >>> >>> Attached is a simple test case (extracted from larger code) that sends one >>> int to rank r+1, and receives from rank r-1 >>> (using MPI_COMM_NULL to handle ranks below 0 or above comm size). >>> >>> Using: >>> >>> ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_1 vg_mpi.c >>> ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind ./a.out >>> >>> I get the following Valgrind error for rank 1: >>> >>> ==8382== Invalid read of size 4 >>> ==8382==at 0x400A00: main (in /home/yvan/test/a.out) >>> ==8382== Address 0xffefffe70 is on thread 1's stack >>> ==8382== in frame #0, created by main (???:) >>> >>> >>> Using: >>> >>> ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_2 vg_mpi.c >>> ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind ./a.out >>> >>> I get the following Valgrind error for rank 1: >>> >>> ==8322== Invalid read of size 4 >>> ==8322==at 0x400A6C: main (in /home/yvan/test/a.out) >>> ==8322== Address 0xcb6f9a0 is 0 bytes inside a block of size 4 alloc'd >>> ==8322==at 0x4C29BBE: malloc (in >>> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) >>> ==8322==by 0x400998: main (in /home/yvan/test/a.out) >>> >>> I get no error for the default variant (no -D_VARIANT...) with either Open >>> MPI 2.0.1, or 1.10.4, >>> but de get an error similar to variant 1 on the parent code from which the >>> example was extracted... >>> >>> is given below. Running under Valgrind's gdb server, for the parent code of >>> variant 1, >>> it even seems the value received on rank 1 is uninitialized, then Valgrind >>> complains >>> with the given message. >>> >>> The code fails to work as intended when run under Valgrind when OpenMPI is >>> built with --enable-memchecker, >>> while it works fine when run with the same build but not under Valgrind, >>> or when run under Valgrind with Open MPI built without memchecker. >>> >>> I'm running under Arch Linux (whosed packaged Open MPI 1.10.4 is built with >>> memchecker enabled, >>> rendering it unusable under Valgrind). >>> >>> Did anybody else encounter this type of issue, or I does my code contain an >>> obvious mistake that I am missing ? >>> I initially though of possible alignment issues, but saw nothing in the >>> standard that requires that, >>> and the "malloc"-base variant exhibits the same behavior,while I assume >>> alignment to 64-bits for allocated arrays is the default. >>> >>> Best regards, >>> >>> Yvan Fournier >>> ___ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users diff --git a/ompi/mca/pml/ob1/pml_ob1_irecv.c b/ompi/mca/pml/ob1/pml_ob1_irecv.c index 56826a2..97a6a38 100644 --- a/ompi/mca/pml/ob1/pml_ob1_irecv.c +++ b/ompi/mca/pml/ob1/pml_ob1_irecv.c @@ -30,6 +30,7 @@ #include "pml_ob1_recvfrag.h" #include "ompi/peruse/peruse-internal.h" #include "ompi/message/message.h" +#include "ompi/memchecker.h" mca_pml_ob1_recv_request_t *mca_pml_ob1_recvreq = NULL; @@ -128,6 +129,17 @@ int mca_pml_ob1_recv(void *addr, rc = recvreq->req_recv.req_base.req_ompi.req_status.MPI_ERROR; +if (recvreq->req_recv.req_base.req_pml_complete) { +/* make buffer defined when the request is compeleted, + and before releasing the objects. */ +MEMCHECKER( +memchecker_call(&opal_memchecker_base_mem_defined, +recvreq->req_recv.req_base.req_addr, +recvreq->req_recv.req_base.req_count, +recvreq->req_recv.req_base.req_datatype); +); +} + #if OMPI_ENABLE_THREAD_MULTI
Re: [OMPI users] False positives and even failure with OpenMPI and memchecker
Hello, Yes, as I had hinted in the my message, I observed the bug in an irregular manner. Glad to see it could be fixed so quickly (it affects 2.0 too). I had observed it for some time, but only recently took the time to make a proper simplified case and investigate. Guess I should have submitted the issue sooner... Best regards, Yvan Fournier > Message: 5 > Date: Sat, 5 Nov 2016 22:08:32 +0900 > From: Gilles Gouaillardet > To: Open MPI Users > Subject: Re: [OMPI users] False positives and even failure with Open > MPI and memchecker > Message-ID: > > Content-Type: text/plain; charset=UTF-8 > > that really looks like a bug > > if you rewrite your program with > > MPI_Sendrecv(&l, 1, MPI_INT, rank_next, tag, &l_prev, 1, MPI_INT, > rank_prev, tag, MPI_COMM_WORLD, &status); > > or even > > MPI_Irecv(&l_prev, 1, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &req); > > MPI_Send(&l, 1, MPI_INT, rank_next, tag, MPI_COMM_WORLD); > > MPI_Wait(&req, &status); > > then there is no more valgrind warning > > iirc, Open MPI marks the receive buffer as invalid memory, so it can > check only MPI subroutine updates it. it looks like a step is missing > in the case of MPI_Recv() > > > Cheers, > > Gilles > > On Sat, Nov 5, 2016 at 9:48 PM, Gilles Gouaillardet > wrote: > > Hi, > > > > note your printf line is missing. > > if you printf l_prev, then the valgrind error occurs in all variants > > > > at first glance, it looks like a false positive, and i will investigate it > > > > > > Cheers, > > > > Gilles > > > > On Sat, Nov 5, 2016 at 7:59 PM, Yvan Fournier wrote: > > > Hello, > > > > > > I have observed what seems to be false positives running under Valgrind > > > when Open MPI is build with --enable-memchecker > > > (at least with versions 1.10.4 and 2.0.1). > > > > > > Attached is a simple test case (extracted from larger code) that sends one > > > int to rank r+1, and receives from rank r-1 > > > (using MPI_COMM_NULL to handle ranks below 0 or above comm size). > > > > > > Using: > > > > > > ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_1 vg_mpi.c > > > ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind > > > ./a.out > > > > > > I get the following Valgrind error for rank 1: > > > > > > ==8382== Invalid read of size 4 > > > ==8382==at 0x400A00: main (in /home/yvan/test/a.out) > > > ==8382== Address 0xffefffe70 is on thread 1's stack > > > ==8382== in frame #0, created by main (???:) > > > > > > > > > Using: > > > > > > ~/opt/openmpi-2.0/bin/mpicc -DVARIANT_2 vg_mpi.c > > > ~/opt/openmpi-2.0/bin/mpiexec -output-filename vg_log -n 2 valgrind > > > ./a.out > > > > > > I get the following Valgrind error for rank 1: > > > > > > ==8322== Invalid read of size 4 > > > ==8322==at 0x400A6C: main (in /home/yvan/test/a.out) > > > ==8322== Address 0xcb6f9a0 is 0 bytes inside a block of size 4 alloc'd > > > ==8322==at 0x4C29BBE: malloc (in /usr/lib/valgrind/vgpreload_memcheck- > > > amd64-linux.so) > > > ==8322==by 0x400998: main (in /home/yvan/test/a.out) > > > > > > I get no error for the default variant (no -D_VARIANT...) with either Open > > > MPI 2.0.1, or 1.10.4, > > > but de get an error similar to variant 1 on the parent code from which the > > > example was extracted... > > > > > > is given below. Running under Valgrind's gdb server, for the parent code > > > of variant 1, > > > it even seems the value received on rank 1 is uninitialized, then Valgrind > > > complains > > > with the given message. > > > > > > The code fails to work as intended when run under Valgrind when OpenMPI is > > > built with --enable-memchecker, > > > while it works fine when run with the same build but not under Valgrind, > > > or when run under Valgrind with Open MPI built without memchecker. > > > > > > I'm running under Arch Linux (whosed packaged Open MPI 1.10.4 is built > > > with memchecker enabled, > > > rendering it unusable under Valgrind). > > > > > > Did anybody else encounter this type of issue, or I does my code contain > > > an obvious mistake that I am missing ? > > > I initially though of possible alignment issues, but saw nothing in the > > > standard that requires that, > > > and the "malloc"-base variant exhibits the same behavior,while I assume > > > alignment to 64-bits for allocated arrays is the default. > > > > > > Best regards, > > > > > > Yvan Fournier > > > ___ > > > users mailing list > > > users@lists.open-mpi.org > > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > -- > > Message: 6 > Date: Sat, 5 Nov 2016 23:12:54 +0900 > From: Gilles Gouaillardet > To: Open MPI Users > Subject: Re: [OMPI users] False positives and even failure with Open > MPI and memchecker > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > so it seems we took some shortcuts in pml/ob1 > > the attached patch (for the v1.10 branch) sh