Dear all, I am struggling with an invalid memory reference when calling SUB EXC_MPI (MOD01), and precisely at MPI_StartAll (see comment) below.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ ! ********** file mod01.f90 ************ ! MODULE MOD01 implicit none include 'mpif.h' ! alternatively ! use mpi ! implicit none PRIVATE ! ... INTERFACE exc_mpi MODULE PROCEDURE exc_mpi END INTERFACE PUBLIC exc_mpi CONTAINS subroutine exc_mpi (X) !! send and receive from procs PN0 <-> PN1 and PN0 <-> PN2 real, dimension (ni:ns, m, l), intent(inout) :: X logical, save :: frstime=.true. integer, save :: mpitype_sn, mpitype_sp, mpitype_rn, mpitype_rp integer, save :: requests(4), reqcount integer :: istatus(MPI_STATUS_SIZE,4), ierr if (frstime) then call exc_init() frstime = .false. end if call MPI_StartAll(reqcount,requests,ierr) !! <-- segfault here call MPI_WaitAll(reqcount,requests,istatus,ierr) return contains subroutine exc_init integer :: i0, ierrs(12), ktag nrequests = 0 ierrs=0 ktag = 1 ! find i0 if ( condition1 ) then ! send to PN2 call MPI_Type_Vector(m*l, messlengthup(PN2), ns-ni+1, MPI_REAL, mpitype_sn, ierrs(1)) call MPI_Type_Commit(mpitype_sn, ierrs(3)) call MPI_Send_Init(X(i0, 1, 1), 1, mpitype_sn, PN2-1, ktag, MPI_COMM_WORLD, requests(reqcount+1), ierrs(5)) ! recieve from PN2 call MPI_Type_Vector(m*l, messlengthdo(PN0), ns-ni+1, MPI_REAL, mpitype_rn, ierrs(2)) call MPI_Type_Commit(mpitype_rn,ierrs(4)) call MPI_Recv_Init(X(nend(irank)+1, 1, 1), 1, mpitype_rn, PN2-1, ktag+1, MPI_COMM_WORLD, requests(nrequests+2), ierrs(6)) nrequests = nrequests + 2 end if if ( condition2 ) then ! send and rec PN0 <-> PN1 nrequests = nrequests + 2 end if return end subroutine exc_init end subroutine exc_mpi ! ... END MODULE MOD01 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ The calls are coming from this other module in a separate file: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ ! ********** file mod02.f90 ************ ! MODULE MOD02 use MOD01, only: exc_mpi IMPLICIT NONE include 'mpif.h' ! alternatively ! use mpi ! implicit none PRIVATE ! ... INTERFACE MYSUB MODULE PROCEDURE MYSUB END INTERFACE PUBLIC MYSUB CONTAINS SUBROUTINE MYSUB (Y) IMPLICIT NONE REAL, INTENT(INOUT) :: Y(nl:nr, m, l) ! ni<=nl, nr>=ns REAL, ALLOCATABLE, DIMENSION(:,:,:) :: Y0 !... allocate ( Y0(n-1:ns, 1:m, 1:l) ) DO i = 1, icount Y0(nl:nr,:,:) = F3(:,:,:) call exc_mpi ( Y0(ni:ns, :, :) ) ! <-- segfault here call mpi_barrier (mpi_comm_world, ierr) Y0(ni-1,:,:) = 0. CALL SUB01 END DO deallocate (Y0) RETURN CONTAINS SUBROUTINE SUB01 !... FRE: DO iterm = 1, m DIR: DO iterl = 1, l DO itern = nl, nr ! Y(itern, iterm, iterl) = some_lin_combination(Y0) END DO END DO DIR END DO FRE END SUBROUTINE SUB01 ! ... END SUBROUTINE MYSUB END MODULE MOD02 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Segmentation fault is raised at runtime when MAIN (actually a sub in a module) calls MYSUB (in MOD02) for the second time, i.e. just MPI_StartAll without re-initialisation. The segfault is an invalid mem reference, but I swear that the bounds aren't changing. The error is not systematic, in the sense that the program works if splitting the job up to a certain number of processes, say NPMAX, which depend on the size of decomposed array (the bigger the size, the higher NPMAX). With more procs than NPMAX, the program segfaults. The same issue arises with [gfortran+ompi], [gfortran+mpich], while with [ifort+mpich] does not always segfault but one process might hang indefinitely. So I bet it is not strictly an ompi issue, so apologize for posting here. It is not a single version issue too: same for deb-jessie, ubuntu 14 and personal 2.0.1 -can share config.log if necessary-. The only thing in common is glibc (2.19, distro stable). Actually the backtrace of ifort-mpich lists libpthread.so. I have not tried with alternative c-libs, nor with newest glibc. Intel Virtual threading is enabled on all the three archs (one mini hpc and two pc). This error is not reported on "serious" archs like nec, sun (ifort+ompi) and ibm. I am trying to find a possible MPI workaround for deb-based systems, maintaining efficiency as much as possible. As can be seen, MOD02 passes to the exchange procedure (MOD01) a sliced array Y0 which is non contiguous. But I should not worry because MPI_Type_Vector is expected to do the remapping job for me. I could almost overcome the fault (NPMAX growing by one order of magnitude) is to exchange the dimensions back and forth, but this causes the execution slowing down approximately a factor of 2. Initialising at each call ( call exc_init() out of the if statement in sub exc_mpi ) does solve, but it is totally inefficient since MAIN (not listed) is looping a lot. I bet that permanently exchanging X/Y/Y0 dimensions will solve but I do not want to loose efficiency of the nested cycles like SUB01 (first dimension is much bigger than the others). Indeed, MOD02 creates a temporary array. Doing that explicitly doesn't solve the issue. Forcing allocation an heap or stack doesn't solve. Any hint out there? Thanks for reading P
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users