Re: [OMPI users] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32)
Thanks, George. My sysadmin now says he is pretty sure it is the firewall, but that "isn't going to change" so we need to find a solution. On 9 February 2018 at 16:58, George Bosilca wrote: > What are the settings of the firewall on your 2 nodes ? > > George. > > > > On Fri, Feb 9, 2018 at 3:08 PM, William Mitchell > wrote: > >> When I try to run an MPI program on a network with a shared file system >> and connected by ethernet, I get the error message "tcp_peer_send_blocking: >> send() to socket 9 failed: Broken pipe (32)" followed by some suggestions >> of what could cause it, none of which are my problem. I have searched the >> FAQ, mailing list archives, and googled the error message, with only a few >> hits touching on it, none of which solved the problem. >> >> This is on a Linux CentOS 7 system with Open MPI 1.10.6 and Intel Fortran >> (more detailed system information below). >> >> Here are details on how I encounter the problem: >> >> me@host1> cat hellompi.f90 >>program hello >>include 'mpif.h' >>integer rank, size, ierror, nl >>character(len=MPI_MAX_PROCESSOR_NAME) :: hostname >> >>call MPI_INIT(ierror) >>call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) >>call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) >>call MPI_GET_PROCESSOR_NAME(hostname, nl, ierror) >>print*, 'node', rank, ' of', size, ' on ', hostname(1:nl), ': Hello >> world' >>call MPI_FINALIZE(ierror) >>end >> >> me@host1> mpifort --showme >> ifort -I/usr/include/openmpi-x86_64 -pthread -m64 >> -I/usr/lib64/openmpi/lib -Wl,-rpath -Wl,/usr/lib64/openmpi/lib >> -Wl,--enable-new-dtags -L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh >> -lmpi >> >> me@host1> ifort --version >> ifort (IFORT) 18.0.0 20170811 >> Copyright (C) 1985-2017 Intel Corporation. All rights reserved. >> >> me@host1> mpifort -o hellompi hellompi.f90 >> >> [Note: it runs on 1 machine, but not on two] >> >> me@host1> mpirun -np 2 hellompi >> node 0 of 2 on host1.domain: Hello world >> node 1 of 2 on host1.domain: Hello world >> >> me@host1> cat hosts >> host2.domain >> host1.domain >> >> me@host1> mpirun -np 2 --hostfile hosts hellompi >> [host2.domain:250313] [[46562,0],1] tcp_peer_send_blocking: send() to >> socket 9 failed: Broken pipe (32) >> >> -- >> ORTE was unable to reliably start one or more daemons. >> This usually is caused by: >> [suggested causes deleted] >> >> Here is system information: >> >> me@host2> cat /etc/redhat-release >> CentOS Linux release 7.4.1708 (Core) >> >> me@host1> uname -a >> Linux host1.domain 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 >> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux >> >> me@host1> rpm -qa | grep openmpi >> mpitests-openmpi-4.1-1.el7.x86_64 >> openmpi-1.10.6-2.el7.x86_64 >> openmpi-devel-1.10.6-2.el7.x86_64 >> >> me@host1> ompi_info --all >> [Results of this command for each host are in the attached files.] >> >> me@host1> ompi_info -v ompi full --parsable >> ompi_info: Error: unknown option "-v" >> [Is the request to run that command given on the Open MPI "Getting Help" >> web page an error?] >> >> me@host1> printenv | grep OMPI >> MPI_COMPILER=openmpi-x86_64 >> OMPI_F77=ifort >> OMPI_FC=ifort >> OMPI_MCA_mpi_yield_when_idle=1 >> OMPI_MCA_btl=tcp,self >> >> I am using ssh-agent, and I can ssh between the two hosts. In fact, from >> host1 I can use ssh to request that host2 ssh back to host1: >> >> me@host1> ssh -A host2 "ssh host1 hostname" >> host1.domain >> >> Any suggestions on how to solve this problem are appreciated. >> >> Bill >> >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32)
William, On a typical HPC cluster, the internal interface is not protected by the firewall. If this is eth0, then you can mpirun --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 ... If only a small range of port is available, then you will also need to use the oob_tcp_dynamic_ipv4_ports, btl_tcp_port_min_v4 and btl_tcp_port_range_v4 MCA params in order to tell MPI which range of ports are open. Cheers, Gilles On Mon, Feb 12, 2018 at 9:23 PM, William Mitchell wrote: > Thanks, George. My sysadmin now says he is pretty sure it is the firewall, > but that "isn't going to change" so we need to find a solution. > > On 9 February 2018 at 16:58, George Bosilca wrote: >> >> What are the settings of the firewall on your 2 nodes ? >> >> George. >> >> >> >> On Fri, Feb 9, 2018 at 3:08 PM, William Mitchell >> wrote: >>> >>> When I try to run an MPI program on a network with a shared file system >>> and connected by ethernet, I get the error message "tcp_peer_send_blocking: >>> send() to socket 9 failed: Broken pipe (32)" followed by some suggestions of >>> what could cause it, none of which are my problem. I have searched the FAQ, >>> mailing list archives, and googled the error message, with only a few hits >>> touching on it, none of which solved the problem. >>> >>> This is on a Linux CentOS 7 system with Open MPI 1.10.6 and Intel Fortran >>> (more detailed system information below). >>> >>> Here are details on how I encounter the problem: >>> >>> me@host1> cat hellompi.f90 >>>program hello >>>include 'mpif.h' >>>integer rank, size, ierror, nl >>>character(len=MPI_MAX_PROCESSOR_NAME) :: hostname >>> >>>call MPI_INIT(ierror) >>>call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) >>>call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) >>>call MPI_GET_PROCESSOR_NAME(hostname, nl, ierror) >>>print*, 'node', rank, ' of', size, ' on ', hostname(1:nl), ': Hello >>> world' >>>call MPI_FINALIZE(ierror) >>>end >>> >>> me@host1> mpifort --showme >>> ifort -I/usr/include/openmpi-x86_64 -pthread -m64 >>> -I/usr/lib64/openmpi/lib -Wl,-rpath -Wl,/usr/lib64/openmpi/lib >>> -Wl,--enable-new-dtags -L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh >>> -lmpi >>> >>> me@host1> ifort --version >>> ifort (IFORT) 18.0.0 20170811 >>> Copyright (C) 1985-2017 Intel Corporation. All rights reserved. >>> >>> me@host1> mpifort -o hellompi hellompi.f90 >>> >>> [Note: it runs on 1 machine, but not on two] >>> >>> me@host1> mpirun -np 2 hellompi >>> node 0 of 2 on host1.domain: Hello world >>> node 1 of 2 on host1.domain: Hello world >>> >>> me@host1> cat hosts >>> host2.domain >>> host1.domain >>> >>> me@host1> mpirun -np 2 --hostfile hosts hellompi >>> [host2.domain:250313] [[46562,0],1] tcp_peer_send_blocking: send() to >>> socket 9 failed: Broken pipe (32) >>> >>> -- >>> ORTE was unable to reliably start one or more daemons. >>> This usually is caused by: >>> [suggested causes deleted] >>> >>> Here is system information: >>> >>> me@host2> cat /etc/redhat-release >>> CentOS Linux release 7.4.1708 (Core) >>> >>> me@host1> uname -a >>> Linux host1.domain 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 >>> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux >>> >>> me@host1> rpm -qa | grep openmpi >>> mpitests-openmpi-4.1-1.el7.x86_64 >>> openmpi-1.10.6-2.el7.x86_64 >>> openmpi-devel-1.10.6-2.el7.x86_64 >>> >>> me@host1> ompi_info --all >>> [Results of this command for each host are in the attached files.] >>> >>> me@host1> ompi_info -v ompi full --parsable >>> ompi_info: Error: unknown option "-v" >>> [Is the request to run that command given on the Open MPI "Getting Help" >>> web page an error?] >>> >>> me@host1> printenv | grep OMPI >>> MPI_COMPILER=openmpi-x86_64 >>> OMPI_F77=ifort >>> OMPI_FC=ifort >>> OMPI_MCA_mpi_yield_when_idle=1 >>> OMPI_MCA_btl=tcp,self >>> >>> I am using ssh-agent, and I can ssh between the two hosts. In fact, from >>> host1 I can use ssh to request that host2 ssh back to host1: >>> >>> me@host1> ssh -A host2 "ssh host1 hostname" >>> host1.domain >>> >>> Any suggestions on how to solve this problem are appreciated. >>> >>> Bill >>> >>> ___ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >> >> >> >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] mpirun issue using more than 64 hosts
I'm running OpenMPI 2.1.0, built from source, on RHEL 7. I'm using the default ssh-based launcher, where I have my private ssh key on rank 0 and the associated public key on all ranks. I create a hosts file with a list of unique IPs, with the host that I'm running mpirun from on the first line, and run this command: mpirun -N 1 --bind-to none --hostfile hosts.txt hostname This works fine up to 64 machines. At 65 or greater, I get ssh errors. Frequently Permission denied (publickey,gssapi-keyex,gssapi-with-mic) though today another user got Host key verification failed. I have confirmed I can successfully manually ssh into these instances. I've also written a loop in bash which will background an ssh sleep command to > 64 instances and this succeeds. >From what I can tell, the /etc/ssh/ssh*config settings that limit ssh connections have to do with inbound, not outbound limits, and I can prove by running straight ssh commands that I'm not hitting a limit. Is there something wrong with my mpirun syntax (I've run this way thousands of times without issues with fewer than 64 hosts, and I know MPI is frequently used on orders of magnitudes more hosts than this)? Or is this a known bug that's addressed in a later MPI release? Thanks for the help. -Adam ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] mpirun issue using more than 64 hosts
Adam, by default, when more than 64 hosts are involved, mpirun uses a tree spawn in order to remote launch the orted daemons. That means you have two options here : - allow all compute nodes to ssh each other (e.g. the ssh private key of *all* the nodes should be in *all* the authorized_keys - do not use a tree spawn (e.g. mpirun --mca plm_rsh_no_tree_spawn true ...) I recommend the first option, otherwise mpirun would fork&exec a large number of ssh processes and hence use quite a lot of resources on the node running mpirun. Cheers, Gilles On Tue, Feb 13, 2018 at 8:23 AM, Adam Sylvester wrote: > I'm running OpenMPI 2.1.0, built from source, on RHEL 7. I'm using the > default ssh-based launcher, where I have my private ssh key on rank 0 and > the associated public key on all ranks. I create a hosts file with a list > of unique IPs, with the host that I'm running mpirun from on the first line, > and run this command: > > mpirun -N 1 --bind-to none --hostfile hosts.txt hostname > > This works fine up to 64 machines. At 65 or greater, I get ssh errors. > Frequently > > Permission denied (publickey,gssapi-keyex,gssapi-with-mic) > > though today another user got > > Host key verification failed. > > I have confirmed I can successfully manually ssh into these instances. I've > also written a loop in bash which will background an ssh sleep command to > > 64 instances and this succeeds. > > From what I can tell, the /etc/ssh/ssh*config settings that limit ssh > connections have to do with inbound, not outbound limits, and I can prove by > running straight ssh commands that I'm not hitting a limit. > > Is there something wrong with my mpirun syntax (I've run this way thousands > of times without issues with fewer than 64 hosts, and I know MPI is > frequently used on orders of magnitudes more hosts than this)? Or is this a > known bug that's addressed in a later MPI release? > > Thanks for the help. > -Adam > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] mpirun issue using more than 64 hosts
A... thanks Gilles. That makes sense. I was stuck thinking there was an ssh problem on rank 0; it never occurred to me mpirun was doing something clever there and that those ssh errors were from a different instance altogether. It's no problem to put my private key on all instances - I'll go that route. -Adam On Mon, Feb 12, 2018 at 7:12 PM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Adam, > > by default, when more than 64 hosts are involved, mpirun uses a tree > spawn in order to remote launch the orted daemons. > > That means you have two options here : > - allow all compute nodes to ssh each other (e.g. the ssh private key > of *all* the nodes should be in *all* the authorized_keys > - do not use a tree spawn (e.g. mpirun --mca plm_rsh_no_tree_spawn true > ...) > > I recommend the first option, otherwise mpirun would fork&exec a large > number of ssh processes and hence use quite a lot of > resources on the node running mpirun. > > Cheers, > > Gilles > > On Tue, Feb 13, 2018 at 8:23 AM, Adam Sylvester wrote: > > I'm running OpenMPI 2.1.0, built from source, on RHEL 7. I'm using the > > default ssh-based launcher, where I have my private ssh key on rank 0 and > > the associated public key on all ranks. I create a hosts file with a > list > > of unique IPs, with the host that I'm running mpirun from on the first > line, > > and run this command: > > > > mpirun -N 1 --bind-to none --hostfile hosts.txt hostname > > > > This works fine up to 64 machines. At 65 or greater, I get ssh errors. > > Frequently > > > > Permission denied (publickey,gssapi-keyex,gssapi-with-mic) > > > > though today another user got > > > > Host key verification failed. > > > > I have confirmed I can successfully manually ssh into these instances. > I've > > also written a loop in bash which will background an ssh sleep command > to > > > 64 instances and this succeeds. > > > > From what I can tell, the /etc/ssh/ssh*config settings that limit ssh > > connections have to do with inbound, not outbound limits, and I can > prove by > > running straight ssh commands that I'm not hitting a limit. > > > > Is there something wrong with my mpirun syntax (I've run this way > thousands > > of times without issues with fewer than 64 hosts, and I know MPI is > > frequently used on orders of magnitudes more hosts than this)? Or is > this a > > known bug that's addressed in a later MPI release? > > > > Thanks for the help. > > -Adam > > > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users