Howard, Looks like ob1 is working fine. When I looked into the problems with ob1, it looked like the progress thread was polling the Portals event queue before it had been initialized.
b. $ mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib osu_latency WARNING: Ummunotify not found: Not using ummunotify can result in incorrect results download and install ummunotify from: http://support.systemfabricworks.com/downloads/ummunotify/ummunotify-v2.tar.bz2 WARNING: Ummunotify not found: Not using ummunotify can result in incorrect results download and install ummunotify from: http://support.systemfabricworks.com/downloads/ummunotify/ummunotify-v2.tar.bz2 # OSU MPI Latency Test # Size Latency (us) 0 1.87 1 1.93 2 1.90 4 1.94 8 1.94 16 1.96 32 1.97 64 1.99 128 2.43 256 2.50 512 2.71 1024 3.01 2048 3.45 4096 4.56 8192 6.39 16384 8.79 32768 11.50 65536 16.59 131072 27.10 262144 46.97 524288 87.55 1048576 168.89 2097152 331.40 4194304 654.08 > On Feb 7, 2018, at 9:04 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > > HI Brian, > > As a sanity check, can you see if the ob1 pml works okay, i.e. > > mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib ./osu_latency > > Howard > > > 2018-02-07 11:03 GMT-07:00 brian larkins <brianlark...@gmail.com > <mailto:brianlark...@gmail.com>>: > Hello, > > I’m doing some work with Portals4 and am trying to run some MPI programs > using the Portals 4 as the transport layer. I’m running into problems and am > hoping that someone can help me figure out how to get things working. I’m > using OpenMPI 3.0.0 with the following configuration: > > ./configure CFLAGS=-pipe —prefix=path/to/install --enable-picky > --enable-debug --enable-mpi-fortran --with-portals4=path/to/portals4 > --disable-oshmem --disable-vt --disable-java --disable-mpi-io > --disable-io-romio --disable-libompitrace --disable-btl-portals4-flow-control > --disable-mtl-portals4-flow-control > > I have also tried the head from the git repo and 2.1.2 with the same results. > A simpler configure line (w —prefix and —with-portals4=) also gets same > results. > > Portals4 configuration is from github master and configured thus: > > ./configure —prefix=path/to/portals4 --with-ev=path/to/libev > --enable-transport-ib --enable-fast --enable-zero-mrs --enable-me-triggered > > If I specify the cm pml on the command-line, I can get examples/hello_c to > run correctly. Trying to get some latency numbers using the OSU benchmarks is > where my trouble begins: > > $ mpirun -n 2 --mca mtl portals4 --mca pml cm env > PTL_DISABLE_MEM_REG_CACHE=1 ./osu_latency > NOTE: Ummunotify and IB registered mem cache disabled, set > PTL_DISABLE_MEM_REG_CACHE=0 to re-enable. > NOTE: Ummunotify and IB registered mem cache disabled, set > PTL_DISABLE_MEM_REG_CACHE=0 to re-enable. > # OSU MPI Latency Test > # Size Latency (us) > 0 25.96 > [node41:19740] *** An error occurred in MPI_Barrier > [node41:19740] *** reported by process [139815819542529,4294967297] > [node41:19740] *** on communicator MPI_COMM_WORLD > [node41:19740] *** MPI_ERR_OTHER: known error not in list > [node41:19740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will > now abort, > [node41:19740] *** and potentially your MPI job) > > Not specifying CM gets an earlier segfault (defaults to ob1) and looks to be > a progress thread initialization problem. > Using PTL_IGNORE_UMMUNOTIFY=1 gets here: > > $ mpirun --mca pml cm -n 2 env PTL_IGNORE_UMMUNOTIFY=1 ./osu_latency > # OSU MPI Latency Test > # Size Latency (us) > 0 24.14 > 1 26.24 > [node41:19993] *** Process received signal *** > [node41:19993] Signal: Segmentation fault (11) > [node41:19993] Signal code: Address not mapped (1) > [node41:19993] Failing at address: 0x141 > [node41:19993] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fa6ac73b710] > [node41:19993] [ 1] > /ascldap/users/dblarki/opt/portals4.master/lib/libportals.so.4(+0xcd65)[0x7fa69b770d65] > [node41:19993] [ 2] > /ascldap/users/dblarki/opt/portals4.master/lib/libportals.so.4(PtlPut+0x143)[0x7fa69b773fb3] > [node41:19993] [ 3] > /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_portals4.so(+0xa961)[0x7fa698cf5961] > [node41:19993] [ 4] > /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_portals4.so(+0xb0e5)[0x7fa698cf60e5] > [node41:19993] [ 5] > /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_portals4.so(ompi_mtl_portals4_send+0x90)[0x7fa698cf61d1] > [node41:19993] [ 6] > /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_pml_cm.so(+0x5430)[0x7fa69a794430] > [node41:19993] [ 7] > /ascldap/users/dblarki/opt/ompi/lib/libmpi.so.40(PMPI_Send+0x2b4)[0x7fa6ac9ff018] > [node41:19993] [ 8] ./osu_latency[0x40106f] > [node41:19993] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa6ac3b6d5d] > [node41:19993] [10] ./osu_latency[0x400c59] > > This cluster is running RHEL 6.5 without ummunotify modules, but I get the > same results on a local (small) cluster running ubuntu 16.04 with ummunotify > loaded. > > Any help would be much appreciated. > thanks, > > brian. > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > https://lists.open-mpi.org/mailman/listinfo/users > <https://lists.open-mpi.org/mailman/listinfo/users> > > _______________________________________________ > users mailing list > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > https://lists.open-mpi.org/mailman/listinfo/users > <https://lists.open-mpi.org/mailman/listinfo/users> -- D. Brian Larkins Assistant Professor of Computer Science Rhodes College
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users