I am running OpenMPI 1.4.2 under RHEL 5.5. After install, I tested with "mpirun -np 4 date"; the command returned four "date" outputs.
Then I tried running two different MPI programs, "geminimpi" and "salinas". Both run correctly with "mpirun -np 1 $prog". However, both hang indefinitely when I use anything other than "-np 1". Next, I ran "mpirun --debug-daemons -np 1 geminimpi" and got the following: (this looks good, and is what I would expect) [code] [xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 1 geminimpi [XXX_TUX01:06558] [[15027,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200 [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received add_local_procs [XXX_TUX01:06558] [[15027,0],0] orted_recv: received sync+nidmap from local proc [[15027,1],0] [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs Fluid Proc Ready: ID, FluidMaster,LagMaster = 0 0 1 Checking license for Gemini Checking license for Linux OS Checking internal license list License valid GEMINI Startup Gemini +++ Version 5.1.00 20110501 +++ +++++ ERROR MESSAGE +++++ FILE MISSING (Input): name = gemini.inp [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received waitpid_fired cmd [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received iof_complete cmd -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 6559 on node XXX_TUX01 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received exit [/code] With "mpirun --debug-daemons -np 2 geminimpi", it hangs like this: (hangs indefinitely) [code] [xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 2 geminimpi [XXX_TUX01:06570] [[14983,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200 [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received add_local_procs [XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],1] [XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],0] [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received message_local_procs [/code] I cloned my entire installation to a number of other machines to test. On all the other workstations, everything behaves correctly and various regression suites return good results. Any ideas? -- Jon Stergiou Engineer NSWC Carderock