Re: [OMPI users] Connection timed out with multiple nodes

Doug Roberts Wed, 26 Feb 2014 17:33:00 -0500 (EST)


o I should report there has been an important developement in this
problem, before anyone spends time on my previous post.  We have
got the original test program to run without hanging by directly
connecting the two test compute nodes together (thus bypassing the
switch) as shown here, where eth2 is still the 10G interface ie)

[roberpj@bro127:~/samples/openmpi/mpi_test]/opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btltcp,sm,self --mca btl_tcp_if_include eth2 --host bro127,bro128 ./a.out

Number of processes = 2
Test repeated 3 times for reliability
I am process 0 on node bro127
Run 1 of 3
P0: Sending to P1
P0: Waiting to receive from P1
P0: Received from to P1
Run 2 of 3
P0: Sending to P1
P0: Waiting to receive from P1
P0: Received from to P1
Run 3 of 3
P0: Sending to P1
P0: Waiting to receive from P1
P0: Received from to P1
P0: Done
I am process 1 on node bro128
P1: Waiting to receive from to P0
P1: Sending to to P0
P1: Waiting to receive from to P0
P1: Sending to to P0
P1: Waiting to receive from to P0
P1: Sending to to P0
P1: Done

o This now points to the Netgear XSM7224S 10G switch.  The firmware
version turns out to be slightly old at 9.0.1.14, so we will update
it to the latest 9.0.1.29 and then run the test again. I will report
back the result. In the meantime, if anyone knows of configuration
setting(s) in the switch that could block openmpi message passing
then please reply to this comment.  Tx!


---------- Forwarded message ----------
List-Post: users@lists.open-mpi.org
Date: Tue, 25 Feb 2014 20:07:31 -0500 (EST)
From: Doug Roberts <robe...@sharcnet.ca>
To: us...@open-mpi.org
Subject: Re: [OMPI users] Connection timed out with multiple nodes

Hello again, The "oob_stress" program runs cleanly on each of
the two test nodes bro127 and bro128 as shown below.  Would
you say this rules out a problem with the network and switch,
or is there another test program(s) that should be run next ?

o eth0 and eth2: without plm_base_verbose

[roberpj@bro127:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mcaoob_tcp_if_include eth0 ./oob_stress

[bro127:02020] Ring 1 message size 10 bytes
[bro127:02020] [[27318,1],0] Ring 1 completed
[bro127:02020] Ring 2 message size 100 bytes
[bro127:02020] [[27318,1],0] Ring 2 completed
[bro127:02020] Ring 3 message size 1000 bytes
[bro127:02020] [[27318,1],0] Ring 3 completed

[roberpj@bro127:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mcaoob_tcp_if_include eth2 ./oob_stress

[bro127:02022] Ring 1 message size 10 bytes
[bro127:02022] [[27312,1],0] Ring 1 completed
[bro127:02022] Ring 2 message size 100 bytes
[bro127:02022] [[27312,1],0] Ring 2 completed
[bro127:02022] Ring 3 message size 1000 bytes
[bro127:02022] [[27312,1],0] Ring 3 completed

[roberpj@bro128:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mcaoob_tcp_if_include eth0 ./oob_stress

[bro128:04484] Ring 1 message size 10 bytes
[bro128:04484] [[23046,1],0] Ring 1 completed
[bro128:04484] Ring 2 message size 100 bytes
[bro128:04484] [[23046,1],0] Ring 2 completed
[bro128:04484] Ring 3 message size 1000 bytes
[bro128:04484] [[23046,1],0] Ring 3 completed

[roberpj@bro128:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mcaoob_tcp_if_include eth2 ./oob_stress

[bro128:04486] Ring 1 message size 10 bytes
[bro128:04486] [[23040,1],0] Ring 1 completed
[bro128:04486] Ring 2 message size 100 bytes
[bro128:04486] [[23040,1],0] Ring 2 completed
[bro128:04486] Ring 3 message size 1000 bytes
[bro128:04486] [[23040,1],0] Ring 3 completed

o eth2: with plm_base_verbose on

[roberpj@bro127:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mcaoob_tcp_if_include eth2 -mca plm_base_verbose 5 ./oob_stress

[bro127:01936] mca:base:select:(  plm) Querying component [rsh]

[bro127:01936] [[INVALID],INVALID] plm:base:rsh_lookup on agent ssh : rsh pathNULL[bro127:01936] mca:base:select:( plm) Query of component [rsh] set priority to10

[bro127:01936] mca:base:select:(  plm) Querying component [slurm]

[bro127:01936] mca:base:select:( plm) Skipping component [slurm]. Query failedto return a module

[bro127:01936] mca:base:select:(  plm) Querying component [tm]

[bro127:01936] mca:base:select:( plm) Skipping component [tm]. Query failed toreturn a module

[bro127:01936] mca:base:select:(  plm) Selected component [rsh]

[bro127:01936] plm:base:set_hnp_name: initial bias 1936 nodename hash3261509427

[bro127:01936] plm:base:set_hnp_name: final jobfam 27333
[bro127:01936] [[27333,0],0] plm:base:rsh_setup on agent ssh : rsh path NULL
[bro127:01936] [[27333,0],0] plm:base:receive start comm
[bro127:01936] released to spawn
[bro127:01936] [[27333,0],0] plm:base:setup_job for job [INVALID]
[bro127:01936] [[27333,0],0] plm:rsh: launching job [27333,1]
[bro127:01936] [[27333,0],0] plm:rsh: no new daemons to launch
[bro127:01936] [[27333,0],0] plm:base:launch_apps for job [27333,1]
[bro127:01936] [[27333,0],0] plm:base:report_launched for job [27333,1]

[bro127:01936] [[27333,0],0] plm:base:app_report_launch from daemon[[27333,0],0][bro127:01936] [[27333,0],0] plm:base:app_report_launched for proc[[27333,1],0] from daemon [[27333,0],0]: pid 1937 state 4 exit 0

[bro127:01936] [[27333,0],0] plm:base:app_report_launch completed processing
[bro127:01936] [[27333,0],0] plm:base:report_launched all apps reported
[bro127:01936] [[27333,0],0] plm:base:launch wiring up iof
[bro127:01936] [[27333,0],0] plm:base:launch completed for job [27333,1]
[bro127:01936] completed spawn for job [27333,1]
[bro127:01937] Ring 1 message size 10 bytes
[bro127:01937] [[27333,1],0] Ring 1 completed
[bro127:01937] Ring 2 message size 100 bytes
[bro127:01937] [[27333,1],0] Ring 2 completed
[bro127:01937] Ring 3 message size 1000 bytes
[bro127:01937] [[27333,1],0] Ring 3 completed
[bro127:01936] [[27333,0],0] plm:base:receive processing msg
[bro127:01936] [[27333,0],0] plm:base:receive update proc state command

[bro127:01936] [[27333,0],0] plm:base:receive got update_proc_state for job[27333,1][bro127:01936] [[27333,0],0] plm:base:receive got update_proc_state for vpid 0state 80 exit_code 0[bro127:01936] [[27333,0],0] plm:base:receive updating state for proc[[27333,1],0] current state 10 new state 80[bro127:01936] [[27333,0],0] plm:base:check_job_completed for job [27333,1] -num_terminated 1 num_procs 1[bro127:01936] [[27333,0],0] plm:base:check_job_completed declared job[27333,1] normally terminated - checking all jobs

[bro127:01936] [[27333,0],0] releasing procs from node bro127
[bro127:01936] [[27333,0],0] releasing proc [[27333,1],0] from node bro127

[bro127:01936] [[27333,0],0] plm:base:check_job_completed all jobs terminated -waking up

[bro127:01936] [[27333,0],0] plm:base:orted_cmd sending orted_exit commands
[bro127:01936] [[27333,0],0] plm:base:receive stop comm
[bro127:01936] [[27333,0],0] plm:base:local:slave:finalize

[roberpj@bro128:~/samples/openmpi/oob_stress] mpirun -npernode 1 -mcaoob_tcp_if_include eth2 -mca plm_base_verbose 5 ./oob_stress

[bro128:04462] mca:base:select:(  plm) Querying component [rsh]

[bro128:04462] [[INVALID],INVALID] plm:base:rsh_lookup on agent ssh : rsh pathNULL[bro128:04462] mca:base:select:( plm) Query of component [rsh] set priority to10

[bro128:04462] mca:base:select:(  plm) Querying component [slurm]

[bro128:04462] mca:base:select:( plm) Skipping component [slurm]. Query failedto return a module

[bro128:04462] mca:base:select:(  plm) Querying component [tm]

[bro128:04462] mca:base:select:( plm) Skipping component [tm]. Query failed toreturn a module

[bro128:04462] mca:base:select:(  plm) Selected component [rsh]
[bro128:04462] plm:base:set_hnp_name: initial bias 4462 nodename hash 186663077
[bro128:04462] plm:base:set_hnp_name: final jobfam 23275
[bro128:04462] [[23275,0],0] plm:base:rsh_setup on agent ssh : rsh path NULL
[bro128:04462] [[23275,0],0] plm:base:receive start comm
[bro128:04462] released to spawn
[bro128:04462] [[23275,0],0] plm:base:setup_job for job [INVALID]
[bro128:04462] [[23275,0],0] plm:rsh: launching job [23275,1]
[bro128:04462] [[23275,0],0] plm:rsh: no new daemons to launch
[bro128:04462] [[23275,0],0] plm:base:launch_apps for job [23275,1]
[bro128:04462] [[23275,0],0] plm:base:report_launched for job [23275,1]

[bro128:04462] [[23275,0],0] plm:base:app_report_launch from daemon[[23275,0],0][bro128:04462] [[23275,0],0] plm:base:app_report_launched for proc[[23275,1],0] from daemon [[23275,0],0]: pid 4463 state 4 exit 0

[bro128:04462] [[23275,0],0] plm:base:app_report_launch completed processing
[bro128:04462] [[23275,0],0] plm:base:report_launched all apps reported
[bro128:04462] [[23275,0],0] plm:base:launch wiring up iof
[bro128:04462] [[23275,0],0] plm:base:launch completed for job [23275,1]
[bro128:04462] completed spawn for job [23275,1]
[bro128:04463] Ring 1 message size 10 bytes
[bro128:04463] [[23275,1],0] Ring 1 completed
[bro128:04463] Ring 2 message size 100 bytes
[bro128:04463] [[23275,1],0] Ring 2 completed
[bro128:04463] Ring 3 message size 1000 bytes
[bro128:04463] [[23275,1],0] Ring 3 completed
[bro128:04462] [[23275,0],0] plm:base:receive processing msg
[bro128:04462] [[23275,0],0] plm:base:receive update proc state command

[bro128:04462] [[23275,0],0] plm:base:receive got update_proc_state for job[23275,1][bro128:04462] [[23275,0],0] plm:base:receive got update_proc_state for vpid 0state 80 exit_code 0[bro128:04462] [[23275,0],0] plm:base:receive updating state for proc[[23275,1],0] current state 10 new state 80[bro128:04462] [[23275,0],0] plm:base:check_job_completed for job [23275,1] -num_terminated 1 num_procs 1[bro128:04462] [[23275,0],0] plm:base:check_job_completed declared job[23275,1] normally terminated - checking all jobs

[bro128:04462] [[23275,0],0] releasing procs from node bro128
[bro128:04462] [[23275,0],0] releasing proc [[23275,1],0] from node bro128

[bro128:04462] [[23275,0],0] plm:base:check_job_completed all jobs terminated -waking up

[bro128:04462] [[23275,0],0] plm:base:orted_cmd sending orted_exit commands
[bro128:04462] [[23275,0],0] plm:base:receive stop comm
[bro128:04462] [[23275,0],0] plm:base:local:slave:finalize


---------- Forwarded message ----------
List-Post: users@lists.open-mpi.org
Date: Fri, 31 Jan 2014 13:55:41 -0800
From: Ralph Castain <r...@open-mpi.org>
Reply-To: Open MPI Users <us...@open-mpi.org>
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] Connection timed out with multiple nodes

The only relevant parts are from the application procs - orterun and the orteddon't participate in this exchange and never see the BTLs anyway.

It looks like there is just something blocking data transfer across eth2 forsome reason. I'm afraid I have no idea why - can you run a standard (i.e.,non-MPI) test across it?


For example, I have an oob-stress program in orte/test/system. Try running it

mpirun -npernode 1 -mca oob_tcp_if_include eth2 ./oob_stress

and see if anything works. If the out-of-band can't communicate, this won'teven start - it'll just hang. If you configure OMPI --enable-debug, you can add-mca plm_base_verbose 5 to watch the launch operation and see if the remotedaemon is able to respond.

My guess is that the answer will be "no" and that this will hang, but thatwould tell us the problem is in the network and not in the TCP BTL.

Re: [OMPI users] Connection timed out with multiple nodes

Reply via email to