The error is from btl/vader. CMA is not functioning as expected. It might work 
if you set btl_vader_single_copy_mechanism=none

Performance will suffer though. It would be worth understanding with 
process_readv is failing.

Can you send a simple reproducer?

-Nathan

> On Feb 24, 2020, at 2:59 PM, Gabriel, Edgar via users 
> <users@lists.open-mpi.org> wrote:
> 
> 
> I am not an expert for the one-sided code in Open MPI, I wanted to comment 
> briefly on the potential MPI -IO related item. As far as I can see, the error 
> message
>  
> “Read -1, expected 48, errno = 1” 
> 
> does not stem from MPI I/O, at least not from the ompio library. What file 
> system did you use for these tests?
>  
> Thanks
> Edgar
>  
> From: users <users-boun...@lists.open-mpi.org> On Behalf Of Matt Thompson via 
> users
> Sent: Monday, February 24, 2020 1:20 PM
> To: users@lists.open-mpi.org
> Cc: Matt Thompson <fort...@gmail.com>
> Subject: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, 
> Fails in Open MPI
>  
> All,
>  
> My guess is this is a "I built Open MPI incorrectly" sort of issue, but I'm 
> not sure how to fix it. Namely, I'm currently trying to get an MPI project's 
> CI working on CircleCI using Open MPI to run some unit tests (on a single 
> node, so need some oversubscribe). I can build everything just fine, but when 
> I try to run, things just...blow up:
>  
> [root@3796b115c961 build]# /opt/openmpi-4.0.2/bin/mpirun -np 18 
> -oversubscribe /root/project/MAPL/build/bin/pfio_ctest_io.x -nc 6 -nsi 6 -nso 
> 6 -ngo 1 -ngi 1 -v T,U -s mpi
>  start app rank:           0
>  start app rank:           1
>  start app rank:           2
>  start app rank:           3
>  start app rank:           4
>  start app rank:           5
> [3796b115c961:03629] Read -1, expected 48, errno = 1
> [3796b115c961:03629] *** An error occurred in MPI_Get
> [3796b115c961:03629] *** reported by process [2144600065,12]
> [3796b115c961:03629] *** on win rdma window 5
> [3796b115c961:03629] *** MPI_ERR_OTHER: known error not in list
> [3796b115c961:03629] *** MPI_ERRORS_ARE_FATAL (processes in this win will now 
> abort,
> [3796b115c961:03629] ***    and potentially your MPI job)
>  
> I'm currently more concerned about the MPI_Get error, though I'm not sure 
> what that "Read -1, expected 48, errno = 1" bit is about (MPI-IO error?). Now 
> this code is fairly fancy MPI code, so I decided to try a simpler one. 
> Searched the internet and found an example program here:
>  
> https://software.intel.com/en-us/blogs/2014/08/06/one-sided-communication
>  
> and when I build and run with Intel MPI it works:
>  
> (1027)(master) $ mpirun -V
> Intel(R) MPI Library for Linux* OS, Version 2018 Update 4 Build 20180823 (id: 
> 18555)
> Copyright 2003-2018 Intel Corporation.
> (1028)(master) $ mpiicc rma_test.c
> (1029)(master) $ mpirun -np 2 ./a.out
> srun.slurm: cluster configuration lacks support for cpu binding
> Rank 0 running on borgj001
> Rank 1 running on borgj001
> Rank 0 sets data in the shared memory: 00 01 02 03
> Rank 1 sets data in the shared memory: 10 11 12 13
> Rank 0 gets data from the shared memory: 10 11 12 13
> Rank 1 gets data from the shared memory: 00 01 02 03
> Rank 0 has new data in the shared memory:Rank 1 has new data in the shared 
> memory: 10 11 12 13
>  00 01 02 03
>  
> So, I have some confidence it was written correctly. Now on the same system I 
> try with Open MPI (building with gcc, not Intel C):
>  
> (1032)(master) $ mpirun -V
> mpirun (Open MPI) 4.0.1
> 
> Report bugs to http://www.open-mpi.org/community/help/
> (1033)(master) $ mpicc rma_test.c
> (1034)(master) $ mpirun -np 2 ./a.out
> Rank 0 running on borgj001
> Rank 1 running on borgj001
> Rank 0 sets data in the shared memory: 00 01 02 03
> Rank 1 sets data in the shared memory: 10 11 12 13
> [borgj001:22668] *** An error occurred in MPI_Get
> [borgj001:22668] *** reported by process [2514223105,1]
> [borgj001:22668] *** on win rdma window 3
> [borgj001:22668] *** MPI_ERR_RMA_RANGE: invalid RMA address range
> [borgj001:22668] *** MPI_ERRORS_ARE_FATAL (processes in this win will now 
> abort,
> [borgj001:22668] ***    and potentially your MPI job)
> [borgj001:22642] 1 more process has sent help message help-mpi-errors.txt / 
> mpi_errors_are_fatal
> [borgj001:22642] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages
>  
> This is a similar failure to above. Any ideas what I might be doing wrong 
> here? I don't doubt I'm missing something, but I'm not sure what. Open MPI 
> was built pretty boringly:
>  
> Configure command line: '--with-slurm' '--enable-shared' 
> '--disable-wrapper-rpath' '--disable-wrapper-runpath' 
> '--enable-mca-no-build=btl-usnic' '--prefix=...'
>  
> And I'm not sure if we need those disable-wrapper bits anymore, but long ago 
> we needed them, and so they've lived on in "how to build" READMEs until 
> something breaks. This btl-usnic is a bit unknown to me (this was built by 
> sysadmins on a cluster), but this is pretty close to how I build on my 
> desktop and it has the same issue.
>  
> Any ideas from the experts?
>  
> --
> Matt Thompson
>    “The fact is, this is about us identifying what we do best and 
>    finding more ways of doing less of it better” -- Director of Better Anna 
> Rampton

Reply via email to