Thank you explicitly setting the interface like below has resolved this. 

Thanks, 

Dean 

> On 28 Nov 2020, at 10:27, Gilles Gouaillardet via users 
> <users@lists.open-mpi.org> wrote:
> 
> Dean,
> 
> That typically occurs when some nodes have multiple interfaces, and
> several nodes have a similar IP on a private/unused interface.
> 
> I suggest you explicitly restrict the interface Open MPI should be using.
> For example, you can
> 
> mpirun --mca btl_tcp_if_include eth0 ...
> 
> Cheers,
> 
> Gilles
> 
> On Fri, Nov 27, 2020 at 7:36 PM CHESTER, DEAN (PGR) via users
> <users@lists.open-mpi.org> wrote:
>> 
>> Hi,
>> 
>> I am trying to set up some machines with OpenMPI connected with ethernet to 
>> expand some batch system we already have in use.
>> 
>> This is controlled with Slurm already and we are able to get a basic MPI 
>> program running across 2 of the machines but when I compile and something 
>> that actually performs communication it fails.
>> 
>> Slurm was not configured with PMI/PMI2 so we require running with mpirun for 
>> program execution.
>> 
>> OpenMPI is installed on my home space which is accessible on all of the 
>> nodes we are trying to run on.
>> 
>> My hello world application gets the world size, rank and hostname and prints 
>> this. This successfully launches and runs.
>> 
>> Hello world from processor viper-03, rank 0 out of 8 processors
>> Hello world from processor viper-03, rank 1 out of 8 processors
>> Hello world from processor viper-03, rank 2 out of 8 processors
>> Hello world from processor viper-03, rank 3 out of 8 processors
>> Hello world from processor viper-04, rank 4 out of 8 processors
>> Hello world from processor viper-04, rank 5 out of 8 processors
>> Hello world from processor viper-04, rank 6 out of 8 processors
>> Hello world from processor viper-04, rank 7 out of 8 processors
>> 
>> I then tried to run the OSU micro-benchmarks but these fail to run. I get 
>> the following output:
>> 
>> # OSU MPI Latency Test v5.6.3
>> # Size          Latency (us)
>> [viper-01:25885] [[21336,0],0] ORTE_ERROR_LOG: Data unpack would read past 
>> end of buffer in file util/show_help.c at line 507
>> --------------------------------------------------------------------------
>> WARNING: Open MPI accepted a TCP connection from what appears to be a
>> another Open MPI process but cannot find a corresponding process
>> entry for that peer.
>> 
>> This attempted connection will be ignored; your MPI job may or may not
>> continue properly.
>> 
>>  Local host: viper-02
>>  PID:        20406
>> —————————————————————————————————————
>> 
>> The machines are firewall yet the ports 9000-9060 are open. I have set the 
>> following MCA parameters to match the open ports:
>> 
>> btl_tcp_port_min_v4=9000
>> btl_tcp_port_range_v4=60
>> oob_tcp_dynamic_ipv4_ports=9020
>> 
>> OpenMPI 4.0.5 was built with GCC 4.8.5 and only the installation prefix was 
>> set to $HOME/local/ompi.
>> 
>> What else could be going wrong?
>> 
>> Kind Regards,
>> 
>> Dean

Reply via email to