[OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640 Mon, 11 Apr 2011 10:07:25 -0400

I am running OpenMPI 1.4.2 under RHEL 5.5.  After install, I tested with 
"mpirun -np 4 date"; the command returned four "date" outputs.


Then I tried running two different MPI programs, "geminimpi" and "salinas".  
Both run correctly with "mpirun -np 1 $prog".  However, both hang indefinitely 
when I use anything other than "-np 1".  

Next, I ran "mpirun --debug-daemons -np 1 geminimpi" and got the following:  
(this looks good, and is what I would expect)

[code]
[xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 1 geminimpi
[XXX_TUX01:06558] [[15027,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_recv: received sync+nidmap from local 
proc [[15027,1],0]
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
Fluid Proc Ready: ID, FluidMaster,LagMaster =     0    0    1
 Checking license for Gemini
 Checking license for Linux OS
 Checking internal license list
 License valid

 GEMINI Startup
 Gemini +++ Version 5.1.00  20110501 +++    

 +++++ ERROR MESSAGE +++++
 FILE MISSING (Input): name = gemini.inp
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received waitpid_fired cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received iof_complete cmd
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 6559 on
node XXX_TUX01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received exit
[/code]

With "mpirun --debug-daemons -np 2 geminimpi", it hangs like this: (hangs 
indefinitely)

[code]
[xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 2 geminimpi
[XXX_TUX01:06570] [[14983,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local 
proc [[14983,1],1]
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local 
proc [[14983,1],0]
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received message_local_procs
[/code]


I cloned my entire installation to a number of other machines to test.  On all 
the other workstations, everything behaves correctly and various regression 
suites return good results. 

Any ideas? 

--
Jon Stergiou
Engineer
NSWC Carderock

[OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

Reply via email to