Re: [OMPI users] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32)

2018-02-12 Thread William Mitchell
Thanks, George.  My sysadmin now says he is pretty sure it is the firewall,
but that "isn't going to change" so we need to find a solution.

On 9 February 2018 at 16:58, George Bosilca  wrote:

> What are the settings of the firewall on your 2 nodes ?
>
>   George.
>
>
>
> On Fri, Feb 9, 2018 at 3:08 PM, William Mitchell 
> wrote:
>
>> When I try to run an MPI program on a network with a shared file system
>> and connected by ethernet, I get the error message "tcp_peer_send_blocking:
>> send() to socket 9 failed: Broken pipe (32)" followed by some suggestions
>> of what could cause it, none of which are my problem.  I have searched the
>> FAQ, mailing list archives, and googled the error message, with only a few
>> hits touching on it, none of which solved the problem.
>>
>> This is on a Linux CentOS 7 system with Open MPI 1.10.6 and Intel Fortran
>> (more detailed system information below).
>>
>> Here are details on how I encounter the problem:
>>
>> me@host1> cat hellompi.f90
>>program hello
>>include 'mpif.h'
>>integer rank, size, ierror, nl
>>character(len=MPI_MAX_PROCESSOR_NAME) :: hostname
>>
>>call MPI_INIT(ierror)
>>call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>>call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>>call MPI_GET_PROCESSOR_NAME(hostname, nl, ierror)
>>print*, 'node', rank, ' of', size, ' on ', hostname(1:nl), ': Hello
>> world'
>>call MPI_FINALIZE(ierror)
>>end
>>
>> me@host1> mpifort --showme
>> ifort -I/usr/include/openmpi-x86_64 -pthread -m64
>> -I/usr/lib64/openmpi/lib -Wl,-rpath -Wl,/usr/lib64/openmpi/lib
>> -Wl,--enable-new-dtags -L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh
>> -lmpi
>>
>> me@host1> ifort --version
>> ifort (IFORT) 18.0.0 20170811
>> Copyright (C) 1985-2017 Intel Corporation.  All rights reserved.
>>
>> me@host1> mpifort -o hellompi hellompi.f90
>>
>> [Note: it runs on 1 machine, but not on two]
>>
>> me@host1> mpirun -np 2 hellompi
>>  node   0  of   2  on host1.domain: Hello world
>>  node   1  of   2  on host1.domain: Hello world
>>
>> me@host1> cat hosts
>> host2.domain
>> host1.domain
>>
>> me@host1> mpirun -np 2 --hostfile hosts hellompi
>> [host2.domain:250313] [[46562,0],1] tcp_peer_send_blocking: send() to
>> socket 9 failed: Broken pipe (32)
>> 
>> --
>> ORTE was unable to reliably start one or more daemons.
>> This usually is caused by:
>> [suggested causes deleted]
>>
>> Here is system information:
>>
>> me@host2> cat /etc/redhat-release
>> CentOS Linux release 7.4.1708 (Core)
>>
>> me@host1> uname -a
>> Linux host1.domain 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58
>> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>>
>> me@host1> rpm -qa | grep openmpi
>> mpitests-openmpi-4.1-1.el7.x86_64
>> openmpi-1.10.6-2.el7.x86_64
>> openmpi-devel-1.10.6-2.el7.x86_64
>>
>> me@host1> ompi_info --all
>> [Results of this command for each host are in the attached files.]
>>
>> me@host1> ompi_info -v ompi full --parsable
>> ompi_info: Error: unknown option "-v"
>> [Is the request to run that command given on the Open MPI "Getting Help"
>> web page an error?]
>>
>> me@host1> printenv | grep OMPI
>> MPI_COMPILER=openmpi-x86_64
>> OMPI_F77=ifort
>> OMPI_FC=ifort
>> OMPI_MCA_mpi_yield_when_idle=1
>> OMPI_MCA_btl=tcp,self
>>
>> I am using ssh-agent, and I can ssh between the two hosts.  In fact, from
>> host1 I can use ssh to request that host2 ssh back to host1:
>>
>> me@host1> ssh -A host2 "ssh host1 hostname"
>> host1.domain
>>
>> Any suggestions on how to solve this problem are appreciated.
>>
>> Bill
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32)

2018-02-12 Thread Gilles Gouaillardet
William,

On a typical HPC cluster, the internal interface is not protected by
the firewall.
If this is eth0, then you can

mpirun --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 ...

If only a small range of port is available, then you will also need to use the

oob_tcp_dynamic_ipv4_ports, btl_tcp_port_min_v4 and
btl_tcp_port_range_v4 MCA params in order to tell MPI which range of
ports are open.

Cheers,

Gilles

On Mon, Feb 12, 2018 at 9:23 PM, William Mitchell  wrote:
> Thanks, George.  My sysadmin now says he is pretty sure it is the firewall,
> but that "isn't going to change" so we need to find a solution.
>
> On 9 February 2018 at 16:58, George Bosilca  wrote:
>>
>> What are the settings of the firewall on your 2 nodes ?
>>
>>   George.
>>
>>
>>
>> On Fri, Feb 9, 2018 at 3:08 PM, William Mitchell 
>> wrote:
>>>
>>> When I try to run an MPI program on a network with a shared file system
>>> and connected by ethernet, I get the error message "tcp_peer_send_blocking:
>>> send() to socket 9 failed: Broken pipe (32)" followed by some suggestions of
>>> what could cause it, none of which are my problem.  I have searched the FAQ,
>>> mailing list archives, and googled the error message, with only a few hits
>>> touching on it, none of which solved the problem.
>>>
>>> This is on a Linux CentOS 7 system with Open MPI 1.10.6 and Intel Fortran
>>> (more detailed system information below).
>>>
>>> Here are details on how I encounter the problem:
>>>
>>> me@host1> cat hellompi.f90
>>>program hello
>>>include 'mpif.h'
>>>integer rank, size, ierror, nl
>>>character(len=MPI_MAX_PROCESSOR_NAME) :: hostname
>>>
>>>call MPI_INIT(ierror)
>>>call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>>>call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>>>call MPI_GET_PROCESSOR_NAME(hostname, nl, ierror)
>>>print*, 'node', rank, ' of', size, ' on ', hostname(1:nl), ': Hello
>>> world'
>>>call MPI_FINALIZE(ierror)
>>>end
>>>
>>> me@host1> mpifort --showme
>>> ifort -I/usr/include/openmpi-x86_64 -pthread -m64
>>> -I/usr/lib64/openmpi/lib -Wl,-rpath -Wl,/usr/lib64/openmpi/lib
>>> -Wl,--enable-new-dtags -L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh
>>> -lmpi
>>>
>>> me@host1> ifort --version
>>> ifort (IFORT) 18.0.0 20170811
>>> Copyright (C) 1985-2017 Intel Corporation.  All rights reserved.
>>>
>>> me@host1> mpifort -o hellompi hellompi.f90
>>>
>>> [Note: it runs on 1 machine, but not on two]
>>>
>>> me@host1> mpirun -np 2 hellompi
>>>  node   0  of   2  on host1.domain: Hello world
>>>  node   1  of   2  on host1.domain: Hello world
>>>
>>> me@host1> cat hosts
>>> host2.domain
>>> host1.domain
>>>
>>> me@host1> mpirun -np 2 --hostfile hosts hellompi
>>> [host2.domain:250313] [[46562,0],1] tcp_peer_send_blocking: send() to
>>> socket 9 failed: Broken pipe (32)
>>>
>>> --
>>> ORTE was unable to reliably start one or more daemons.
>>> This usually is caused by:
>>> [suggested causes deleted]
>>>
>>> Here is system information:
>>>
>>> me@host2> cat /etc/redhat-release
>>> CentOS Linux release 7.4.1708 (Core)
>>>
>>> me@host1> uname -a
>>> Linux host1.domain 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58
>>> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> me@host1> rpm -qa | grep openmpi
>>> mpitests-openmpi-4.1-1.el7.x86_64
>>> openmpi-1.10.6-2.el7.x86_64
>>> openmpi-devel-1.10.6-2.el7.x86_64
>>>
>>> me@host1> ompi_info --all
>>> [Results of this command for each host are in the attached files.]
>>>
>>> me@host1> ompi_info -v ompi full --parsable
>>> ompi_info: Error: unknown option "-v"
>>> [Is the request to run that command given on the Open MPI "Getting Help"
>>> web page an error?]
>>>
>>> me@host1> printenv | grep OMPI
>>> MPI_COMPILER=openmpi-x86_64
>>> OMPI_F77=ifort
>>> OMPI_FC=ifort
>>> OMPI_MCA_mpi_yield_when_idle=1
>>> OMPI_MCA_btl=tcp,self
>>>
>>> I am using ssh-agent, and I can ssh between the two hosts.  In fact, from
>>> host1 I can use ssh to request that host2 ssh back to host1:
>>>
>>> me@host1> ssh -A host2 "ssh host1 hostname"
>>> host1.domain
>>>
>>> Any suggestions on how to solve this problem are appreciated.
>>>
>>> Bill
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] mpirun issue using more than 64 hosts

2018-02-12 Thread Adam Sylvester
I'm running OpenMPI 2.1.0, built from source, on RHEL 7.  I'm using the
default ssh-based launcher, where I have my private ssh key on rank 0 and
the associated public key on all ranks.  I create a hosts file with a list
of unique IPs, with the host that I'm running mpirun from on the first
line, and run this command:

mpirun -N 1 --bind-to none --hostfile hosts.txt hostname

This works fine up to 64 machines.  At 65 or greater, I get ssh errors.
Frequently

Permission denied (publickey,gssapi-keyex,gssapi-with-mic)

though today another user got

Host key verification failed.

I have confirmed I can successfully manually ssh into these instances.
I've also written a loop in bash which will background an ssh sleep command
to > 64 instances and this succeeds.

>From what I can tell, the /etc/ssh/ssh*config settings that limit ssh
connections have to do with inbound, not outbound limits, and I can prove
by running straight ssh commands that I'm not hitting a limit.

Is there something wrong with my mpirun syntax (I've run this way thousands
of times without issues with fewer than 64 hosts, and I know MPI is
frequently used on orders of magnitudes more hosts than this)?  Or is this
a known bug that's addressed in a later MPI release?

Thanks for the help.
-Adam
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpirun issue using more than 64 hosts

2018-02-12 Thread Gilles Gouaillardet
Adam,

by default, when more than 64 hosts are involved, mpirun uses a tree
spawn in order to remote launch the orted daemons.

That means you have two options here :
 - allow all compute nodes to ssh each other (e.g. the ssh private key
of *all* the nodes should be in *all* the authorized_keys
 - do not use a tree spawn (e.g. mpirun --mca plm_rsh_no_tree_spawn true ...)

I recommend the first option, otherwise mpirun would fork&exec a large
number of ssh processes and  hence use quite a lot of
resources on the node running mpirun.

Cheers,

Gilles

On Tue, Feb 13, 2018 at 8:23 AM, Adam Sylvester  wrote:
> I'm running OpenMPI 2.1.0, built from source, on RHEL 7.  I'm using the
> default ssh-based launcher, where I have my private ssh key on rank 0 and
> the associated public key on all ranks.  I create a hosts file with a list
> of unique IPs, with the host that I'm running mpirun from on the first line,
> and run this command:
>
> mpirun -N 1 --bind-to none --hostfile hosts.txt hostname
>
> This works fine up to 64 machines.  At 65 or greater, I get ssh errors.
> Frequently
>
> Permission denied (publickey,gssapi-keyex,gssapi-with-mic)
>
> though today another user got
>
> Host key verification failed.
>
> I have confirmed I can successfully manually ssh into these instances.  I've
> also written a loop in bash which will background an ssh sleep command to >
> 64 instances and this succeeds.
>
> From what I can tell, the /etc/ssh/ssh*config settings that limit ssh
> connections have to do with inbound, not outbound limits, and I can prove by
> running straight ssh commands that I'm not hitting a limit.
>
> Is there something wrong with my mpirun syntax (I've run this way thousands
> of times without issues with fewer than 64 hosts, and I know MPI is
> frequently used on orders of magnitudes more hosts than this)?  Or is this a
> known bug that's addressed in a later MPI release?
>
> Thanks for the help.
> -Adam
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] mpirun issue using more than 64 hosts

2018-02-12 Thread Adam Sylvester
A... thanks Gilles.  That makes sense.  I was stuck thinking there was
an ssh problem on rank 0; it never occurred to me mpirun was doing
something clever there and that those ssh errors were from a different
instance altogether.

It's no problem to put my private key on all instances - I'll go that route.

-Adam

On Mon, Feb 12, 2018 at 7:12 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Adam,
>
> by default, when more than 64 hosts are involved, mpirun uses a tree
> spawn in order to remote launch the orted daemons.
>
> That means you have two options here :
>  - allow all compute nodes to ssh each other (e.g. the ssh private key
> of *all* the nodes should be in *all* the authorized_keys
>  - do not use a tree spawn (e.g. mpirun --mca plm_rsh_no_tree_spawn true
> ...)
>
> I recommend the first option, otherwise mpirun would fork&exec a large
> number of ssh processes and  hence use quite a lot of
> resources on the node running mpirun.
>
> Cheers,
>
> Gilles
>
> On Tue, Feb 13, 2018 at 8:23 AM, Adam Sylvester  wrote:
> > I'm running OpenMPI 2.1.0, built from source, on RHEL 7.  I'm using the
> > default ssh-based launcher, where I have my private ssh key on rank 0 and
> > the associated public key on all ranks.  I create a hosts file with a
> list
> > of unique IPs, with the host that I'm running mpirun from on the first
> line,
> > and run this command:
> >
> > mpirun -N 1 --bind-to none --hostfile hosts.txt hostname
> >
> > This works fine up to 64 machines.  At 65 or greater, I get ssh errors.
> > Frequently
> >
> > Permission denied (publickey,gssapi-keyex,gssapi-with-mic)
> >
> > though today another user got
> >
> > Host key verification failed.
> >
> > I have confirmed I can successfully manually ssh into these instances.
> I've
> > also written a loop in bash which will background an ssh sleep command
> to >
> > 64 instances and this succeeds.
> >
> > From what I can tell, the /etc/ssh/ssh*config settings that limit ssh
> > connections have to do with inbound, not outbound limits, and I can
> prove by
> > running straight ssh commands that I'm not hitting a limit.
> >
> > Is there something wrong with my mpirun syntax (I've run this way
> thousands
> > of times without issues with fewer than 64 hosts, and I know MPI is
> > frequently used on orders of magnitudes more hosts than this)?  Or is
> this a
> > known bug that's addressed in a later MPI release?
> >
> > Thanks for the help.
> > -Adam
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users