Re: [OMPI users] OpenIB problems
Hi Guys, The alternative to THREAD_MULTIPLE problem is to use --mca mpi_leave_pinned 1 to mpirun option. This will ensure 1 RDMA operation contrary to splitting data in MAX RDMA size (default to 1MB). If your data size is small say below 1 MB, program will run well with THREAD_MULTIPLE. Problem comes when data size increases and OpenMPI starts splitting it. I think even with Bigger sizes, Program works if interconnect is TCP, but fails to work on IB. So on IB, you can run your program if you set mca paramter mpi_leave_pinned to 1. Cheers Neeraj On Thu, 29 Nov 2007 Brock Palen wrote : >Jeff thanks for all the reply's, > >Hate to admit but at the moment we can't log onto the switch. > >But the ibcheckerrors command returns nothing out of bounds, and i >think that command also checks the switch ports. > >Thanks, we will do some tests > >Brock Palen >Center for Advanced Computing >bro...@umich.edu >(734)936-1985 > > >On Nov 27, 2007, at 4:50 PM, Jeff Squyres wrote: > > > Sorry for jumping in late; the holiday and other travel prevented me > > from getting to all my mail recently... :-\ > > > > Have you checked the counters on the subnet manager to see if any > > other errors are occurring? It might be good to clear all the > > counters, run the job, and see if the counters are increasing faster > > than they should (i.e., any particular counter should advance very > > very slowly -- perhaps 1 per day or so). > > > > I'll ask around the kernel-level guys (i.e., Roland) to see what else > > could cause this kind of error. > > > > > > > > On Nov 27, 2007, at 3:35 PM, Brock Palen wrote: > > > >> Ok i will open a case with cisco, > >> > >> > >> Brock Palen > >> Center for Advanced Computing > >> bro...@umich.edu > >> (734)936-1985 > >> > >> > >> On Nov 27, 2007, at 4:19 PM, Andrew Friedley wrote: > >> > >>> > >>> > >>> Brock Palen wrote: > >> What would be a place to look? Should this just be default then > >> for > >> OMPI? ompi_info shows the default as 10 seconds? Is that right > >> 'seconds' ? > > The other IB guys can probably answer better than I can -- I'm > > not an > > expert in this part of IB (or really any part I guess :). Not > > sure > > why > > a larger value isn't the default. No, its not seconds -- check > > the > > description of the MCA parameter: > > > > 4.096 microseconds * (2^btl_openib_ib_timeout) > > You sure? > ompi_info --param btl openib > > MCA btl: parameter "btl_openib_ib_timeout" (current value: "10") > InfiniBand transmit timeout, in seconds > (must be >= 1) > >>> > >>> Yeah: > >>> > >>> MCA btl: parameter "btl_openib_ib_timeout" (current value: "10") > >>> InfiniBand transmit timeout, plugged into formula: > >>> 4.096 microseconds * (2^btl_openib_ib_timeout)(must be > = 0 and <= 31) > >>> > >>> Reading earlier in the thread you said OMPI v1.2.0, I got this > >>> from a > >>> trunk checkout thats around 3 weeks old. A quick check shows this > >>> description was changed between 1.2.0 and 1.2.1. However the use of > >>> this parameter hasn't changed -- it's simply passed along to IB > >>> verbs > >>> when creating a queue pair (aka a connection). > >>> > >>> Andrew > >>> ___ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> > >> > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > Cisco Systems > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > >___ >users mailing list >us...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] [openMPI-infiniband] openMPI in IB network when openSM with LASH is running
> There is work starting literally right about now to allow Open MPI to > use the RDMA CM and/or the IBCM for creating OpenFabrics connections > (IB or iWARP). when this is expected to be completed? -Mahesh
Re: [OMPI users] ./configure error on windows while installing openmpi-1.2.4(latest)
Hi George, Thanks for your reply, i passed --disable-mpi-f77 option to the configure script, but now the compiler failed with following reason. configure: error: Cannot support Fortran MPI_ADDRESS_KIND! can you pls let me know, how to get rid of this problem.( i.e what option to pass) Thanks, Geetha On 11/28/07, George Bosilca wrote: > > If your F77 compiler do not support array of LOGICAL variables (which > seems to be the case if you look in the config.log file), then you're > left with only one option. Remove the F77 support from the > compilation. This means adding the --disable-mpi-f77 option to the ./ > configure. > > Thanks, > george. > > On Nov 28, 2007, at 9:24 AM, geetha r wrote: > > > Hi, > >Subject: "Need exact command line for ./configure > > {optionslist} " to build OPENMPI-1.2.4 on windows." > > > > > > while configuration script checking the FORTRAN77 compiler , iam > > getting following error,so openmpi- build is unsuccessful on > > windows(with configure script) > > > > checking for correct handling of FORTRAN logical arrays... no > > configure: error: Error determining if arrays of logical values work > > properly. > > > > > > i want to build, openmpi-1.2.4 (which is downloaded from MINGW), on > > windows -2000 machine. > > > > can somebody give proper build command i can use to "build opennmpi > > on windows-2000" machine. > > > > i.e > > > > ./configure ...(options list) > > > > can some body pls tell "exact options to pass" in the option list. > > > > iam using cygwin to build openmpi on windows. > > > > PS: > > I am attaching the output files. > > > > config.log -> actual log file. > > config.out -> output of the ./configure file > > make.out -> fail because, configure build unsuccess on windows. > > make.install-> fail because, configure build unsuccess on windows > > > > > > PS: I am using all g77,g++,gcc from MINGW package. > > > > i have downloaded and added g95 also, but which does not solve my > > problem. > > > > Thanks, > > Geetha > > > > > * > > > ** > ** > > ** WARNING: This email contains an attachment of a very suspicious > > type. ** > > ** You are urged NOT to open this attachment unless you are > > absolutely ** > > ** sure it is legitimate. Opening this attachment may cause > > irreparable ** > > ** damage to your computer and your files. If you have any > > questions ** > > ** about the validity of this message, PLEASE SEEK HELP BEFORE > > OPENING IT. ** > > > ** > ** > > ** This warning was added by the IU Computer Science Dept. mail > > scanner. ** > > > * > > > > < > > make > > .install > > .zip > > > > > < > > make > > .out > > .zip > > > > > < > > config > > .out.zip>___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >
[OMPI users] configure: error: Cannot support Fortran MPI_ADDRESS_KIND!
Hi Terry, Thanks for your reply, ARRAY of logical problem gone, when i used -disable-mpi-f77 option,but now iam getting following error configure: error: Cannot support Fortran MPI_ADDRESS_KIND! option string iam using as follows ./configure --disable-mpi-f77 --with-devel-headers. Thanks, geetha. On 11/29/07, Terry Frankcombe wrote: > > On Wed, 2007-11-28 at 13:20 -0500, George Bosilca wrote: > > If your F77 compiler do not support array of LOGICAL variables (which > > seems to be the case if you look in the config.log file), then you're > > left with only one option. Remove the F77 support from the > > compilation. This means adding the --disable-mpi-f77 option to the ./ > > configure. > > It's a lot weirder than that. > > configure: WARNING: *** Fortran 77 REAL*8 does not have expected size! > configure: WARNING: *** Expected 8, got 8 > configure: WARNING: *** Disabling MPI support for Fortran 77 REAL*8 > > Somehow, 8/=8 > > :-\ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Newbie: Using hostfile
A non MPI application does run without any issues. Could eloberate on what you mean by doing mpirun "hostname". You mean i just do an 'mpirun lynx' in my case??? On Nov 28, 2007 9:57 PM, Jeff Squyres wrote: > Well, that's odd. > > What happens if you try to mpirun "hostname" (i.e., a non-MPI > application)? Does it run, or does it hang? > > > > On Nov 23, 2007, at 6:00 AM, Madireddy Samuel Vijaykumar wrote: > > > I have been using using clusters for some tests. My localhost "lynx" > > and i have "puma" and "tiger" which make up the cluster. All have > > passwordless ssh enabled. Now if i have the following in my > > hostfile(perline in the same order) > > > > lynx > > puma > > tiger > > > > My tests(from lynx) run over the cluster without any issues. > > > > But if move/remove the lynx from there either (perline in the same > > order) > > > > puma > > lynx > > tiger > > > > or > > > > puma > > tiger > > > > My test(from lynx) just does not get any where. It just hangs. And > > does not proceed at all. Is this an issue with way my script handles > > the cluster node. Or is there an method for the hostfile. Thanks. > > > > -- > > Sam aka Vijju > > :)~ > > Linux: Open, True and Cool > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > Cisco Systems > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Sam aka Vijju :)~ Linux: Open, True and Cool
Re: [OMPI users] Run a process double
Hi, Am 29.11.2007 um 00:02 schrieb Henry Adolfo Lambis Miranda: This is my first post to the mail list. I have installed openmp 1.2.4 over a x_64 AMD double processor with SuSE linux. In principal, the instalation was succefull, with ifort 10.X. But when i run any code ( mpirun -np 2 a.out), instead of share the calcules between the two processor, the system duplicate the executable and send one to each processor. this seems to be fine. What were you expecting? With OpenMP you will see threads, and with OpenMPI processes. -- Reuti i don´t know what the h$%& is going on.. regards.. Henry -- Henry Adolfo Lambis Miranda,Chem.Eng. Molecular Simulation Group I & II Rovira i Virgili University. http://www.etseq.urv.es/ms Av. Pa?sos Catalans, 26 C.P. 43007. Tarragona, Catalunya Espanya. "No podr?s quedarte en casa, hermano. No podr?s encender, apagar y olvidarte () Porque la revoluci?n no ser? televisada". Gil Scott-Heron (The Revolution Will Not Be Televised, 1974) Es una cosa bastante repugnante el exito. Su falsa semejanza con el merito enga?a a los hombres. -- Victor Hugo. (1802-1885) Novelista franc?s. El militar es una planta que hay que cuidar con esmero para que no de sus frutos. -- Jacques Tati. "La libertad viene en paquetes peque?os, usualmente TCP/IP" Colombian Reality bite: http://www.youtube.com/watch?v=jn3vM_5kIgM http://en.wikipedia.org/wiki/Cartagena,_Colombia http://www.youtube.com/watch?v=cvxMWSsrwg0 http://www.youtube.com/watch?v=eVmYf5U6x3k __ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Newbie: Using hostfile
On Nov 29, 2007, at 2:09 AM, Madireddy Samuel Vijaykumar wrote: A non MPI application does run without any issues. Could eloberate on what you mean by doing mpirun "hostname". You mean i just do an 'mpirun lynx' in my case??? No, I mean mpirun --hostfile hostname This should run the "hostname" command on each of your nodes. If running "hostname" doesn't work after changing the order, then something is very wrong. If it *does* work, it implies something that there is faulty in the MPI startup (which is more complicated than starting up non-MPI applications). On Nov 28, 2007 9:57 PM, Jeff Squyres wrote: Well, that's odd. What happens if you try to mpirun "hostname" (i.e., a non-MPI application)? Does it run, or does it hang? On Nov 23, 2007, at 6:00 AM, Madireddy Samuel Vijaykumar wrote: I have been using using clusters for some tests. My localhost "lynx" and i have "puma" and "tiger" which make up the cluster. All have passwordless ssh enabled. Now if i have the following in my hostfile(perline in the same order) lynx puma tiger My tests(from lynx) run over the cluster without any issues. But if move/remove the lynx from there either (perline in the same order) puma lynx tiger or puma tiger My test(from lynx) just does not get any where. It just hangs. And does not proceed at all. Is this an issue with way my script handles the cluster node. Or is there an method for the hostfile. Thanks. -- Sam aka Vijju :)~ Linux: Open, True and Cool ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Sam aka Vijju :)~ Linux: Open, True and Cool ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] [openMPI-infiniband] openMPI in IB network when openSM with LASH is running
On Nov 29, 2007, at 12:08 AM, Keshetti Mahesh wrote: There is work starting literally right about now to allow Open MPI to use the RDMA CM and/or the IBCM for creating OpenFabrics connections (IB or iWARP). when this is expected to be completed? It will not planned to be released until the v1.3 series is released. See http://www.open-mpi.org/community/lists/users/2007/11/4535.php https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3 -- Jeff Squyres Cisco Systems
Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
Hi Bob I'm afraid the person most familiar with the oob subsystem recently left the project, so we are somewhat hampered at the moment. I don't recognize the "Software caused connection abort" error message - it doesn't appear to be one of ours (at least, I couldn't find it anywhere in our code base, though I can't swear it isn't there in some dark corner), and I don't find it in my own sys/errno.h file. With those caveats, all I can say is that something appears to be blocking the connection from your remote node back to the head node. Are you sure both nodes are available on IPv4 (since you disabled IPv6)? Can you try ssh'ing to the remote node and doing a ping to the head node using the IPv4 interface? Do you have another method you could use to check and see if max14 will accept connections from max15? If I interpret the error message correctly, it looks like something in the connect handshake is being aborted. We try a couple of times, but then give up and try other interfaces - since no other interface is available, you get that other error message and we abort. Sorry I can't be more help - like I said, this is now a weak spot in our coverage that needs to be rebuilt. Ralph On 11/28/07 2:41 PM, "Bob Soliday" wrote: > I am new to openmpi and have a problem that I cannot seem to solve. > I am trying to run the hello_c example and I can't get it to work. > I compiled openmpi with: > > ./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6 > --with-openib > > The hostname file contains the local host and one other node. When I > run it I get: > > > [soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun -- > debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2 > hello_c > [max14:31465] [0,0,0] accepting connections via event library > [max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe > [max14:31466] [0,0,1] accepting connections via event library > [max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe > [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 > [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting > port 55152 to: 192.168.2.14:38852 > [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: > sending ack, 0 > [max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255 > [max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14 > nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 > [max14:31466] [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 > nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 > [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 > [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 > [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 > Daemon [0,0,1] checking in as pid 31466 on host max14 > [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 > [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 > [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to > 192.168.1.14:38852 failed: Software caused connection abort (103) > [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to > 192.168.1.14:38852 failed: Software caused connection abort (103) > [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to > 192.168.1.14:38852 failed, connecting over all interfaces failed! > [max15:28222] OOB: Connection to HNP lost > [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0] > [max14:31466] [0,0,1] orted_recv_pls: received kill_local_procs > [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 > [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/ > pls_base_orted_cmds.c at line 275 > [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c > at line 1166 > [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at > line 90 > [max14:31465] ERROR: A daemon on node max15 failed to start as expected. > [max14:31465] ERROR: There may be more information available from > [max14:31465] ERROR: the remote shell (see above). > [max14:31465] ERROR: The daemon exited unexpectedly with status 1. > [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0] > [max14:31466] [0,0,1] orted_recv_pls: received exit > [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 > [max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: peer closed > connection > [max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_peer_close(0x523100) sd 6 > state 4 > [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/ > pls_base_orted_cmds.c at line 188 > [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c > at line 1198 > -- > mpirun was unable to cleanly terminate the daemons for this job. > Returned value Timeout instead of ORTE_SUCCESS. > -- > > > > I can see that the orted deamon program is starting on both computers > but i
Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
I solved the problem by making a change to orte/mca/oob/tcp/oob_tcp_peer.c On Linux 2.6 I have read that after a failed connect system call the next call to connect can immediately return ECONNABORTED and not try to actually connect, the next call to connect will then work. So I changed mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call connect again. The hello_c example script is now working. I don't think this has solved the underlying cause as to way connect is failing in the first place but at least now I move on to the next step. My best guess at the moment is that it is using eth0 initially when I want it to use eth1. This fails and then when it moves on to eth1 I run into the "can't call connect after it just failed bug". --Bob Ralph H Castain wrote: Hi Bob I'm afraid the person most familiar with the oob subsystem recently left the project, so we are somewhat hampered at the moment. I don't recognize the "Software caused connection abort" error message - it doesn't appear to be one of ours (at least, I couldn't find it anywhere in our code base, though I can't swear it isn't there in some dark corner), and I don't find it in my own sys/errno.h file. With those caveats, all I can say is that something appears to be blocking the connection from your remote node back to the head node. Are you sure both nodes are available on IPv4 (since you disabled IPv6)? Can you try ssh'ing to the remote node and doing a ping to the head node using the IPv4 interface? Do you have another method you could use to check and see if max14 will accept connections from max15? If I interpret the error message correctly, it looks like something in the connect handshake is being aborted. We try a couple of times, but then give up and try other interfaces - since no other interface is available, you get that other error message and we abort. Sorry I can't be more help - like I said, this is now a weak spot in our coverage that needs to be rebuilt. Ralph On 11/28/07 2:41 PM, "Bob Soliday" wrote: I am new to openmpi and have a problem that I cannot seem to solve. I am trying to run the hello_c example and I can't get it to work. I compiled openmpi with: ./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6 --with-openib The hostname file contains the local host and one other node. When I run it I get: [soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun -- debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2 hello_c [max14:31465] [0,0,0] accepting connections via event library [max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe [max14:31466] [0,0,1] accepting connections via event library [max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting port 55152 to: 192.168.2.14:38852 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: sending ack, 0 [max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255 [max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14 nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466] [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 Daemon [0,0,1] checking in as pid 31466 on host max14 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to 192.168.1.14:38852 failed: Software caused connection abort (103) [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to 192.168.1.14:38852 failed: Software caused connection abort (103) [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to 192.168.1.14:38852 failed, connecting over all interfaces failed! [max15:28222] OOB: Connection to HNP lost [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0] [max14:31466] [0,0,1] orted_recv_pls: received kill_local_procs [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/ pls_base_orted_cmds.c at line 275 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [max14:31465] ERROR: A daemon on node max15 failed to start as expected. [max14:31465] ERROR: There may be more information available from [max14:31465] ERROR: the remote shell (see above). [max14:31465] ERROR: The daemon exited unexpectedly with status 1. [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0] [max14:31466] [0,0,1] orted_recv_pls: received exit [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 [max14:31465] [0,0,0]-[
Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
Interesting. Would you mind sharing your patch? -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Bob Soliday Sent: Thursday, November 29, 2007 11:35 AM To: Ralph H Castain Cc: Open MPI Users Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem I solved the problem by making a change to orte/mca/oob/tcp/oob_tcp_peer.c On Linux 2.6 I have read that after a failed connect system call the next call to connect can immediately return ECONNABORTED and not try to actually connect, the next call to connect will then work. So I changed mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call connect again. The hello_c example script is now working. I don't think this has solved the underlying cause as to way connect is failing in the first place but at least now I move on to the next step. My best guess at the moment is that it is using eth0 initially when I want it to use eth1. This fails and then when it moves on to eth1 I run into the "can't call connect after it just failed bug". --Bob Ralph H Castain wrote: > Hi Bob > > I'm afraid the person most familiar with the oob subsystem recently > left the project, so we are somewhat hampered at the moment. I don't > recognize the "Software caused connection abort" error message - it > doesn't appear to be one of ours (at least, I couldn't find it > anywhere in our code base, though I can't swear it isn't there in some > dark corner), and I don't find it in my own sys/errno.h file. > > With those caveats, all I can say is that something appears to be > blocking the connection from your remote node back to the head node. > Are you sure both nodes are available on IPv4 (since you disabled > IPv6)? Can you try ssh'ing to the remote node and doing a ping to the > head node using the IPv4 interface? > > Do you have another method you could use to check and see if max14 > will accept connections from max15? If I interpret the error message > correctly, it looks like something in the connect handshake is being > aborted. We try a couple of times, but then give up and try other > interfaces - since no other interface is available, you get that other error message and we abort. > > Sorry I can't be more help - like I said, this is now a weak spot in > our coverage that needs to be rebuilt. > > Ralph > > > > On 11/28/07 2:41 PM, "Bob Soliday" wrote: > >> I am new to openmpi and have a problem that I cannot seem to solve. >> I am trying to run the hello_c example and I can't get it to work. >> I compiled openmpi with: >> >> ./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6 >> --with-openib >> >> The hostname file contains the local host and one other node. When I >> run it I get: >> >> >> [soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun >> -- debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2 >> hello_c [max14:31465] [0,0,0] accepting connections via event library >> [max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe >> [max14:31466] [0,0,1] accepting connections via event library >> [max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe >> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466] >> [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting port 55152 >> to: 192.168.2.14:38852 [max14:31466] [0,0,1]-[0,0,0] >> mca_oob_tcp_peer_complete_connect: >> sending ack, 0 >> [max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255 >> [max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14 >> nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466] >> [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 nodelay 1 >> sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466] >> [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max14:31466] [0,0,1]-[0,0,0] >> mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0] >> mca_oob_tcp_recv: tag 2 Daemon [0,0,1] checking in as pid 31466 on >> host max14 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 >> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max15:28222] >> [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to >> 192.168.1.14:38852 failed: Software caused connection abort (103) >> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect >> to >> 192.168.1.14:38852 failed: Software caused connection abort (103) >> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect >> to >> 192.168.1.14:38852 failed, connecting over all interfaces failed! >> [max15:28222] OOB: Connection to HNP lost [max14:31466] [0,0,1] >> orted_recv_pls: received message from [0,0,0] [max14:31466] [0,0,1] >> orted_recv_pls: received kill_local_procs [max14:31466] >> [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 [max14:31465] [0,0,0] >> ORTE_ERROR_LOG: Timeout in file base/ pls_base_orted_cmds.c at line >> 275 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file >> pls_rs
Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
Jeff Squyres (jsquyres) wrote: Interesting. Would you mind sharing your patch? -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Bob Soliday Sent: Thursday, November 29, 2007 11:35 AM To: Ralph H Castain Cc: Open MPI Users Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem I solved the problem by making a change to orte/mca/oob/tcp/oob_tcp_peer.c On Linux 2.6 I have read that after a failed connect system call the next call to connect can immediately return ECONNABORTED and not try to actually connect, the next call to connect will then work. So I changed mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call connect again. The hello_c example script is now working. I don't think this has solved the underlying cause as to way connect is failing in the first place but at least now I move on to the next step. My best guess at the moment is that it is using eth0 initially when I want it to use eth1. This fails and then when it moves on to eth1 I run into the "can't call connect after it just failed bug". --Bob I changed oob_tcp_peer.c at line 289 from: /* start the connect - will likely fail with EINPROGRESS */ if(connect(peer->peer_sd, (struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) { /* non-blocking so wait for completion */ if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) { opal_event_add(&peer->peer_send_event, 0); return ORTE_SUCCESS; } opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: " "connect to %s:%d failed: %s (%d)", ORTE_NAME_ARGS(orte_process_info.my_name), ORTE_NAME_ARGS(&(peer->peer_name)), inet_ntoa(inaddr.sin_addr), ntohs(inaddr.sin_port), strerror(opal_socket_errno), opal_socket_errno); continue; } to: /* start the connect - will likely fail with EINPROGRESS */ if(connect(peer->peer_sd, (struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) { /* non-blocking so wait for completion */ if (opal_socket_errno == ECONNABORTED) { if(connect(peer->peer_sd, (struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) { if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) { opal_event_add(&peer->peer_send_event, 0); return ORTE_SUCCESS; } opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: " "connect to %s:%d failed: %s (%d)", ORTE_NAME_ARGS(orte_process_info.my_name), ORTE_NAME_ARGS(&(peer->peer_name)), inet_ntoa(inaddr.sin_addr), ntohs(inaddr.sin_port), strerror(opal_socket_errno), opal_socket_errno); continue; } } else { if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) { opal_event_add(&peer->peer_send_event, 0); return ORTE_SUCCESS; } opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: " "connect to %s:%d failed: %s (%d)", ORTE_NAME_ARGS(orte_process_info.my_name), ORTE_NAME_ARGS(&(peer->peer_name)), inet_ntoa(inaddr.sin_addr), ntohs(inaddr.sin_port), strerror(opal_socket_errno), opal_socket_errno); continue; } }
[OMPI users] Reinitialize MPI_COMM_WORLD
Hi, I have simple MPI program that uses MPI_comm_spawn to create additional child processes. Using MPI_Intercomm_merge, I merge the child and the parent communicator resulting in a single expanded user defined intracommunicator. I know MPI_COMM_WORLD is a constant which is statically initialized during MPI_Init call. But is there a way to update the value of MPI_COMM_WORLD at runtime to reflect this expanded set of processes? Is it possible to some how reinitialize MPI_COMM_WORLD using ompi_comm_init() function? Regards, Rajesh
Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
If you wanted it to use eth1, your other option would be to simply tell it to do so using the mca param. I believe it is something like -mca oob_tcp_if_include eth1 -mca oob_tcp_if_exclude eth0 You may only need the latter since you only have the two interfaces. Ralph On 11/29/07 9:47 AM, "Jeff Squyres (jsquyres)" wrote: > Interesting. Would you mind sharing your patch? > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Bob Soliday > Sent: Thursday, November 29, 2007 11:35 AM > To: Ralph H Castain > Cc: Open MPI Users > Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem > > I solved the problem by making a change to > orte/mca/oob/tcp/oob_tcp_peer.c > > On Linux 2.6 I have read that after a failed connect system call the > next call to connect can immediately return ECONNABORTED and not try to > actually connect, the next call to connect will then work. So I changed > mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call > connect again. The hello_c example script is now working. > > I don't think this has solved the underlying cause as to way connect is > failing in the first place but at least now I move on to the next step. > My best guess at the moment is that it is using eth0 initially when I > want it to use eth1. This fails and then when it moves on to eth1 I run > into the "can't call connect after it just failed bug". > > --Bob > > > Ralph H Castain wrote: >> Hi Bob >> >> I'm afraid the person most familiar with the oob subsystem recently >> left the project, so we are somewhat hampered at the moment. I don't >> recognize the "Software caused connection abort" error message - it >> doesn't appear to be one of ours (at least, I couldn't find it >> anywhere in our code base, though I can't swear it isn't there in some > >> dark corner), and I don't find it in my own sys/errno.h file. >> >> With those caveats, all I can say is that something appears to be >> blocking the connection from your remote node back to the head node. >> Are you sure both nodes are available on IPv4 (since you disabled >> IPv6)? Can you try ssh'ing to the remote node and doing a ping to the >> head node using the IPv4 interface? >> >> Do you have another method you could use to check and see if max14 >> will accept connections from max15? If I interpret the error message >> correctly, it looks like something in the connect handshake is being >> aborted. We try a couple of times, but then give up and try other >> interfaces - since no other interface is available, you get that other > error message and we abort. >> >> Sorry I can't be more help - like I said, this is now a weak spot in >> our coverage that needs to be rebuilt. >> >> Ralph >> >> >> >> On 11/28/07 2:41 PM, "Bob Soliday" wrote: >> >>> I am new to openmpi and have a problem that I cannot seem to solve. >>> I am trying to run the hello_c example and I can't get it to work. >>> I compiled openmpi with: >>> >>> ./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6 > >>> --with-openib >>> >>> The hostname file contains the local host and one other node. When I >>> run it I get: >>> >>> >>> [soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun >>> -- debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2 >>> hello_c [max14:31465] [0,0,0] accepting connections via event library > >>> [max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe >>> [max14:31466] [0,0,1] accepting connections via event library >>> [max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe >>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466] >>> [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting port 55152 >>> to: 192.168.2.14:38852 [max14:31466] [0,0,1]-[0,0,0] >>> mca_oob_tcp_peer_complete_connect: >>> sending ack, 0 >>> [max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255 >>> [max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14 >>> nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466] >>> [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 nodelay 1 >>> sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466] >>> [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max14:31466] [0,0,1]-[0,0,0] > >>> mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0] >>> mca_oob_tcp_recv: tag 2 Daemon [0,0,1] checking in as pid 31466 on >>> host max14 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 >>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max15:28222] >>> [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to >>> 192.168.1.14:38852 failed: Software caused connection abort (103) >>> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect >>> to >>> 192.168.1.14:38852 failed: Software caused connection abort (103) >>> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect >>> to >>> 192.168.1.14:38852 failed, connecting ove
Re: [OMPI users] Reinitialize MPI_COMM_WORLD
no, unfortunately there is no way to do that. In fact, each set of child processes which you spawn has its own MPI_COMM_WORLD. MPI_COMM_WORLD is static and there is no way to change it at runtime... Edgar Rajesh Sudarsan wrote: Hi, I have simple MPI program that uses MPI_comm_spawn to create additional child processes. Using MPI_Intercomm_merge, I merge the child and the parent communicator resulting in a single expanded user defined intracommunicator. I know MPI_COMM_WORLD is a constant which is statically initialized during MPI_Init call. But is there a way to update the value of MPI_COMM_WORLD at runtime to reflect this expanded set of processes? Is it possible to some how reinitialize MPI_COMM_WORLD using ompi_comm_init() function? Regards, Rajesh ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
Thanks, this works. I have now removed my change to oob_tcp_peer.c. --Bob Soliday Ralph Castain wrote: If you wanted it to use eth1, your other option would be to simply tell it to do so using the mca param. I believe it is something like -mca oob_tcp_if_include eth1 -mca oob_tcp_if_exclude eth0 You may only need the latter since you only have the two interfaces. Ralph On 11/29/07 9:47 AM, "Jeff Squyres (jsquyres)" wrote: Interesting. Would you mind sharing your patch? -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Bob Soliday Sent: Thursday, November 29, 2007 11:35 AM To: Ralph H Castain Cc: Open MPI Users Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem I solved the problem by making a change to orte/mca/oob/tcp/oob_tcp_peer.c On Linux 2.6 I have read that after a failed connect system call the next call to connect can immediately return ECONNABORTED and not try to actually connect, the next call to connect will then work. So I changed mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call connect again. The hello_c example script is now working. I don't think this has solved the underlying cause as to way connect is failing in the first place but at least now I move on to the next step. My best guess at the moment is that it is using eth0 initially when I want it to use eth1. This fails and then when it moves on to eth1 I run into the "can't call connect after it just failed bug". --Bob Ralph H Castain wrote: Hi Bob I'm afraid the person most familiar with the oob subsystem recently left the project, so we are somewhat hampered at the moment. I don't recognize the "Software caused connection abort" error message - it doesn't appear to be one of ours (at least, I couldn't find it anywhere in our code base, though I can't swear it isn't there in some dark corner), and I don't find it in my own sys/errno.h file. With those caveats, all I can say is that something appears to be blocking the connection from your remote node back to the head node. Are you sure both nodes are available on IPv4 (since you disabled IPv6)? Can you try ssh'ing to the remote node and doing a ping to the head node using the IPv4 interface? Do you have another method you could use to check and see if max14 will accept connections from max15? If I interpret the error message correctly, it looks like something in the connect handshake is being aborted. We try a couple of times, but then give up and try other interfaces - since no other interface is available, you get that other error message and we abort. Sorry I can't be more help - like I said, this is now a weak spot in our coverage that needs to be rebuilt. Ralph On 11/28/07 2:41 PM, "Bob Soliday" wrote: I am new to openmpi and have a problem that I cannot seem to solve. I am trying to run the hello_c example and I can't get it to work. I compiled openmpi with: ./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6 --with-openib The hostname file contains the local host and one other node. When I run it I get: [soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun -- debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2 hello_c [max14:31465] [0,0,0] accepting connections via event library [max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe [max14:31466] [0,0,1] accepting connections via event library [max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting port 55152 to: 192.168.2.14:38852 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: sending ack, 0 [max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255 [max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14 nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466] [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 Daemon [0,0,1] checking in as pid 31466 on host max14 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to 192.168.1.14:38852 failed: Software caused connection abort (103) [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to 192.168.1.14:38852 failed: Software caused connection abort (103) [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to 192.168.1.14:38852 failed, connecting over all interfaces failed! [max15:28222] OOB: Connection to HNP lost [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0] [max14:31466] [0,0,1] orted_recv_pls: received kill_local_procs [max14:3146