[OMPI users] [version 2.1.5] invalid memory reference

2018-09-18 Thread Patrick Begou

Hi

I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc 7.3.0/OpenMPI 
2.1.5 and with this latest config I have random segfaults.
Same binary, same server, same number of processes (16), same parameters for the 
run. Sometimes it runs until the end, sometime I get  'invalid memory reference'.


Building the application and OpenMPI in debug mode I saw that this random 
segfault always occur in collective communications inside OpenMPI. I've no idea 
howto track this. These are 2 call stack traces (just the openmpi part):


*Calling  MPI_ALLREDUCE(...)**
*
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f01937022ef in ???
#1  0x7f0192dd0331 in mca_btl_vader_check_fboxes
    at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2  0x7f0192dd0331 in mca_btl_vader_component_progress
    at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3  0x7f0192d6b92b in opal_progress
    at ../../opal/runtime/opal_progress.c:226
#4  0x7f0194a8a9a4 in sync_wait_st
    at ../../opal/threads/wait_sync.h:80
#5  0x7f0194a8a9a4 in ompi_request_default_wait_all
    at ../../ompi/request/req_wait.c:221
#6  0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
    at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
#7  0x7f0194aa0a0a in PMPI_Allreduce
    at 
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107

#8  0x7f0194f2e2ba in ompi_allreduce_f
    at 
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87

#9  0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
    at linear_solver_deflation_m.f90:341


*Calling MPI_WAITALL()*

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7fda5a8d72ef in ???
#1  0x7fda59fa5331 in mca_btl_vader_check_fboxes
    at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2  0x7fda59fa5331 in mca_btl_vader_component_progress
    at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3  0x7fda59f4092b in opal_progress
    at ../../opal/runtime/opal_progress.c:226
#4  0x7fda5bc5f9a4 in sync_wait_st
    at ../../opal/threads/wait_sync.h:80
#5  0x7fda5bc5f9a4 in ompi_request_default_wait_all
    at ../../ompi/request/req_wait.c:221
#6  0x7fda5bca329e in PMPI_Waitall
    at 
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76

#7  0x7fda5c10bc00 in ompi_waitall_f
    at 
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104

#8  0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1
    at data_comm_m.f90:5849


The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at
207    /* call the registered callback function */
208   reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc, 
reg->cbdata);



OpenMPI 2.1.5 is build with:
CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native 
-mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \

../configure --prefix=$DESTMPI --enable-mpirun-prefix-by-default 
--disable-dlopen \
--enable-mca-no-build=openib --without-verbs --enable-mpi-cxx --without-slurm 
--enable-mpi-thread-multiple --enable-debug --enable-mem-debug


Any help appreciated

Patrick

--
===
|  Equipe M.O.S.T. |  |
|  Patrick BEGOU   | mailto:patrick.be...@grenoble-inp.fr |
|  LEGI|  |
|  BP 53 X | Tel 04 76 82 51 35   |
|  38041 GRENOBLE CEDEX| Fax 04 76 82 52 71   |
===

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [version 2.1.5] invalid memory reference

2018-09-18 Thread George Bosilca
Few days ago we have pushed a fix in master for a strikingly similar issue.
The patch will eventually make it in the 4.0 and 3.1 but not on the 2.x
series. The best path forward will be to migrate to a more recent OMPI
version.

George.


On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou <
patrick.be...@legi.grenoble-inp.fr> wrote:

> Hi
>
> I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc
> 7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults.
> Same binary, same server, same number of processes (16), same parameters
> for the run. Sometimes it runs until the end, sometime I get  'invalid
> memory reference'.
>
> Building the application and OpenMPI in debug mode I saw that this random
> segfault always occur in collective communications inside OpenMPI. I've no
> idea howto track this. These are 2 call stack traces (just the openmpi
> part):
>
> *Calling  MPI_ALLREDUCE(...)*
>
> Program received signal SIGSEGV: Segmentation fault - invalid memory
> reference.
>
> Backtrace for this error:
> #0  0x7f01937022ef in ???
> #1  0x7f0192dd0331 in mca_btl_vader_check_fboxes
> at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
> #2  0x7f0192dd0331 in mca_btl_vader_component_progress
> at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
> #3  0x7f0192d6b92b in opal_progress
> at ../../opal/runtime/opal_progress.c:226
> #4  0x7f0194a8a9a4 in sync_wait_st
> at ../../opal/threads/wait_sync.h:80
> #5  0x7f0194a8a9a4 in ompi_request_default_wait_all
> at ../../ompi/request/req_wait.c:221
> #6  0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
> at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
> #7  0x7f0194aa0a0a in PMPI_Allreduce
> at
> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
> #8  0x7f0194f2e2ba in ompi_allreduce_f
> at
> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
> #9  0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
> at linear_solver_deflation_m.f90:341
>
>
> *Calling MPI_WAITALL()*
>
> Program received signal SIGSEGV: Segmentation fault - invalid memory
> reference.
>
> Backtrace for this error:
> #0  0x7fda5a8d72ef in ???
> #1  0x7fda59fa5331 in mca_btl_vader_check_fboxes
> at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
> #2  0x7fda59fa5331 in mca_btl_vader_component_progress
> at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
> #3  0x7fda59f4092b in opal_progress
> at ../../opal/runtime/opal_progress.c:226
> #4  0x7fda5bc5f9a4 in sync_wait_st
> at ../../opal/threads/wait_sync.h:80
> #5  0x7fda5bc5f9a4 in ompi_request_default_wait_all
> at ../../ompi/request/req_wait.c:221
> #6  0x7fda5bca329e in PMPI_Waitall
> at
> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
> #7  0x7fda5c10bc00 in ompi_waitall_f
> at
> /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
> #8  0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1
> at data_comm_m.f90:5849
>
>
> The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at
> 207/* call the registered callback function */
> 208   reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc,
> reg->cbdata);
>
>
> OpenMPI 2.1.5 is build with:
> CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native
> -mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \
> ../configure --prefix=$DESTMPI  --enable-mpirun-prefix-by-default
> --disable-dlopen \
> --enable-mca-no-build=openib --without-verbs --enable-mpi-cxx
> --without-slurm --enable-mpi-thread-multiple  --enable-debug
> --enable-mem-debug
>
> Any help appreciated
>
> Patrick
>
> --
> ===
> |  Equipe M.O.S.T. |  |
> |  Patrick BEGOU   | mailto:patrick.be...@grenoble-inp.fr 
>  |
> |  LEGI|  |
> |  BP 53 X | Tel 04 76 82 51 35   |
> |  38041 GRENOBLE CEDEX| Fax 04 76 82 52 71   |
> ===
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Fwd: OpenMPI building fails on Windows Linux Subsystem(WLS).

2018-09-18 Thread Oleg Kmechak
Hello,

I am student of Physics from University of Warsaw, and new to OpenMPI.
Currently just trying to compile it from source code(tried both github
 and  tar
(3.1.2)).
I am using Windows Linux Subsystem(WLS), Ubuntu .

*uname -a:*
>Linux Canopus 4.4.0-17134-Microsoft #285-Microsoft Thu Aug 30 17:31:00 PST
2018 x86_64 x86_64 x86_64 GNU/Linux

I have done all steps suggested in INSTALL and HACKING files, installed
next tool i proper order: M4(1.4.18), autoconf(2.69), automake(1.15.1),
libtool(2.4.6), flex(2.6.4).

Next I enabled AUTOMAKE_JOBS=4 and ran:

*./autogen.pl * #for source code from git hub

Then
*./configure --disable-picky --enable-mpi-cxx --without-cma --enable-static*

I added --without-cma cos I have a lot of warnings about compiling asprintf
function 

and finally:
*make -j 4 all* #cos I have 4 logical processors

And in both versions(from github  or  tar
(3.1.2)) it fails.
Github version error:
>
*../../../../opal/mca/hwloc/hwloc201/hwloc/include/hwloc.h:71:10: fatal
error: hwloc/bitmap.h: No such file or directory #include *

And tar(3.1.2) version:
>*libtool:   error: cannot find the library '../../ompi/libmpi.la
' or unhandled argument '../../ompi/libmpi.la
'*

Please see also full log in attachment
Thanks, hope You will help(cos I passed a lot of time on it currently:) )


PS: if this is a bug or unimplemented feature(WLS is probably quite
specific platform), should I rise issue on github
 project?


Regards, Oleg Kmechak


OpenMPI_log.7z
Description: application/7z
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users