Hi Brian, I'm tracking a two different problems here. The first is that mtl-portals4 is segfaulting in PtlPut(). The second is a btl-portals4 problem you described here:
> Not specifying CM gets an earlier segfault (defaults to ob1) and looks to be a progress thread initialization problem. I have reproduced the mtl-portals4 problem, but I don't have a fix yet. I'm working on a fix for the btl-portals4 problem now. todd On Wed, Feb 7, 2018 at 12:03 PM, brian larkins <brianlark...@gmail.com> wrote: > Hello, > > I’m doing some work with Portals4 and am trying to run some MPI programs > using the Portals 4 as the transport layer. I’m running into problems and > am hoping that someone can help me figure out how to get things working. > I’m using OpenMPI 3.0.0 with the following configuration: > > ./configure CFLAGS=-pipe —prefix=path/to/install --enable-picky > --enable-debug --enable-mpi-fortran --with-portals4=path/to/portals4 > --disable-oshmem --disable-vt --disable-java --disable-mpi-io > --disable-io-romio --disable-libompitrace --disable-btl-portals4-flow-control > --disable-mtl-portals4-flow-control > > I have also tried the head from the git repo and 2.1.2 with the same > results. A simpler configure line (w —prefix and —with-portals4=) also gets > same results. > > Portals4 configuration is from github master and configured thus: > > ./configure —prefix=path/to/portals4 --with-ev=path/to/libev > --enable-transport-ib --enable-fast --enable-zero-mrs --enable-me-triggered > > If I specify the cm pml on the command-line, I can get examples/hello_c to > run correctly. Trying to get some latency numbers using the OSU benchmarks > is where my trouble begins: > > $ mpirun -n 2 --mca mtl portals4 --mca pml cm env > PTL_DISABLE_MEM_REG_CACHE=1 ./osu_latency > NOTE: Ummunotify and IB registered mem cache disabled, set > PTL_DISABLE_MEM_REG_CACHE=0 to re-enable. > NOTE: Ummunotify and IB registered mem cache disabled, set > PTL_DISABLE_MEM_REG_CACHE=0 to re-enable. > # OSU MPI Latency Test > # Size Latency (us) > 0 25.96 > [node41:19740] *** An error occurred in MPI_Barrier > [node41:19740] *** reported by process [139815819542529,4294967297] > [node41:19740] *** on communicator MPI_COMM_WORLD > [node41:19740] *** MPI_ERR_OTHER: known error not in list > [node41:19740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > [node41:19740] *** and potentially your MPI job) > > Not specifying CM gets an earlier segfault (defaults to ob1) and looks to > be a progress thread initialization problem. > Using PTL_IGNORE_UMMUNOTIFY=1 gets here: > > $ mpirun --mca pml cm -n 2 env PTL_IGNORE_UMMUNOTIFY=1 ./osu_latency > # OSU MPI Latency Test > # Size Latency (us) > 0 24.14 > 1 26.24 > [node41:19993] *** Process received signal *** > [node41:19993] Signal: Segmentation fault (11) > [node41:19993] Signal code: Address not mapped (1) > [node41:19993] Failing at address: 0x141 > [node41:19993] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fa6ac73b710] > [node41:19993] [ 1] /ascldap/users/dblarki/opt/portals4.master/lib/ > libportals.so.4(+0xcd65)[0x7fa69b770d65] > [node41:19993] [ 2] /ascldap/users/dblarki/opt/portals4.master/lib/ > libportals.so.4(PtlPut+0x143)[0x7fa69b773fb3] > [node41:19993] [ 3] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_ > portals4.so(+0xa961)[0x7fa698cf5961] > [node41:19993] [ 4] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_ > portals4.so(+0xb0e5)[0x7fa698cf60e5] > [node41:19993] [ 5] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_ > portals4.so(ompi_mtl_portals4_send+0x90)[0x7fa698cf61d1] > [node41:19993] [ 6] /ascldap/users/dblarki/opt/ > ompi/lib/openmpi/mca_pml_cm.so(+0x5430)[0x7fa69a794430] > [node41:19993] [ 7] /ascldap/users/dblarki/opt/ompi/lib/libmpi.so.40(PMPI_ > Send+0x2b4)[0x7fa6ac9ff018] > [node41:19993] [ 8] ./osu_latency[0x40106f] > [node41:19993] [ 9] /lib64/libc.so.6(__libc_start_ > main+0xfd)[0x7fa6ac3b6d5d] > [node41:19993] [10] ./osu_latency[0x400c59] > > This cluster is running RHEL 6.5 without ummunotify modules, but I get the > same results on a local (small) cluster running ubuntu 16.04 with > ummunotify loaded. > > Any help would be much appreciated. > thanks, > > brian. > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users