The error is from btl/vader. CMA is not functioning as expected. It might work if you set btl_vader_single_copy_mechanism=none
Performance will suffer though. It would be worth understanding with process_readv is failing. Can you send a simple reproducer? -Nathan > On Feb 24, 2020, at 2:59 PM, Gabriel, Edgar via users > <users@lists.open-mpi.org> wrote: > > > I am not an expert for the one-sided code in Open MPI, I wanted to comment > briefly on the potential MPI -IO related item. As far as I can see, the error > message > > “Read -1, expected 48, errno = 1” > > does not stem from MPI I/O, at least not from the ompio library. What file > system did you use for these tests? > > Thanks > Edgar > > From: users <users-boun...@lists.open-mpi.org> On Behalf Of Matt Thompson via > users > Sent: Monday, February 24, 2020 1:20 PM > To: users@lists.open-mpi.org > Cc: Matt Thompson <fort...@gmail.com> > Subject: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, > Fails in Open MPI > > All, > > My guess is this is a "I built Open MPI incorrectly" sort of issue, but I'm > not sure how to fix it. Namely, I'm currently trying to get an MPI project's > CI working on CircleCI using Open MPI to run some unit tests (on a single > node, so need some oversubscribe). I can build everything just fine, but when > I try to run, things just...blow up: > > [root@3796b115c961 build]# /opt/openmpi-4.0.2/bin/mpirun -np 18 > -oversubscribe /root/project/MAPL/build/bin/pfio_ctest_io.x -nc 6 -nsi 6 -nso > 6 -ngo 1 -ngi 1 -v T,U -s mpi > start app rank: 0 > start app rank: 1 > start app rank: 2 > start app rank: 3 > start app rank: 4 > start app rank: 5 > [3796b115c961:03629] Read -1, expected 48, errno = 1 > [3796b115c961:03629] *** An error occurred in MPI_Get > [3796b115c961:03629] *** reported by process [2144600065,12] > [3796b115c961:03629] *** on win rdma window 5 > [3796b115c961:03629] *** MPI_ERR_OTHER: known error not in list > [3796b115c961:03629] *** MPI_ERRORS_ARE_FATAL (processes in this win will now > abort, > [3796b115c961:03629] *** and potentially your MPI job) > > I'm currently more concerned about the MPI_Get error, though I'm not sure > what that "Read -1, expected 48, errno = 1" bit is about (MPI-IO error?). Now > this code is fairly fancy MPI code, so I decided to try a simpler one. > Searched the internet and found an example program here: > > https://software.intel.com/en-us/blogs/2014/08/06/one-sided-communication > > and when I build and run with Intel MPI it works: > > (1027)(master) $ mpirun -V > Intel(R) MPI Library for Linux* OS, Version 2018 Update 4 Build 20180823 (id: > 18555) > Copyright 2003-2018 Intel Corporation. > (1028)(master) $ mpiicc rma_test.c > (1029)(master) $ mpirun -np 2 ./a.out > srun.slurm: cluster configuration lacks support for cpu binding > Rank 0 running on borgj001 > Rank 1 running on borgj001 > Rank 0 sets data in the shared memory: 00 01 02 03 > Rank 1 sets data in the shared memory: 10 11 12 13 > Rank 0 gets data from the shared memory: 10 11 12 13 > Rank 1 gets data from the shared memory: 00 01 02 03 > Rank 0 has new data in the shared memory:Rank 1 has new data in the shared > memory: 10 11 12 13 > 00 01 02 03 > > So, I have some confidence it was written correctly. Now on the same system I > try with Open MPI (building with gcc, not Intel C): > > (1032)(master) $ mpirun -V > mpirun (Open MPI) 4.0.1 > > Report bugs to http://www.open-mpi.org/community/help/ > (1033)(master) $ mpicc rma_test.c > (1034)(master) $ mpirun -np 2 ./a.out > Rank 0 running on borgj001 > Rank 1 running on borgj001 > Rank 0 sets data in the shared memory: 00 01 02 03 > Rank 1 sets data in the shared memory: 10 11 12 13 > [borgj001:22668] *** An error occurred in MPI_Get > [borgj001:22668] *** reported by process [2514223105,1] > [borgj001:22668] *** on win rdma window 3 > [borgj001:22668] *** MPI_ERR_RMA_RANGE: invalid RMA address range > [borgj001:22668] *** MPI_ERRORS_ARE_FATAL (processes in this win will now > abort, > [borgj001:22668] *** and potentially your MPI job) > [borgj001:22642] 1 more process has sent help message help-mpi-errors.txt / > mpi_errors_are_fatal > [borgj001:22642] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > > This is a similar failure to above. Any ideas what I might be doing wrong > here? I don't doubt I'm missing something, but I'm not sure what. Open MPI > was built pretty boringly: > > Configure command line: '--with-slurm' '--enable-shared' > '--disable-wrapper-rpath' '--disable-wrapper-runpath' > '--enable-mca-no-build=btl-usnic' '--prefix=...' > > And I'm not sure if we need those disable-wrapper bits anymore, but long ago > we needed them, and so they've lived on in "how to build" READMEs until > something breaks. This btl-usnic is a bit unknown to me (this was built by > sysadmins on a cluster), but this is pretty close to how I build on my > desktop and it has the same issue. > > Any ideas from the experts? > > -- > Matt Thompson > “The fact is, this is about us identifying what we do best and > finding more ways of doing less of it better” -- Director of Better Anna > Rampton