fork() support in OpenFabrics has always been dicey -- it can lead to
random behavior like this. Supposedly it works in a specific set of
circumstances, but I don't have a recent enough kernel on my machines
to test.
It's best not to use calls to system() if they can be avoided.
Indeed, Open MPI v1.3.x will warn you if you create a child process
after MPI_INIT when using OpenFabrics networks.
On May 18, 2009, at 5:05 PM, Noam Bernstein wrote:
Hi all - I have a bizarre OpenMPI hanging problem. I'm running an MPI
code
called CP2K (related to, but not the same as cpmd). The complications
of the
software aside, here are the observations:
At the base is a serial code that uses system() calls to repeatedly
invoke
mpirun cp2k.popt.
When I run from my NFS mounted home directory, everything appears to
be
fine. When I run from a scratch directory local to each node, it
hangs on
the _third_ invokation of CP2K (the 1st and 3rd invokations do
computationally
expensive stuff, the 2nd uses the code in a different mode which does
a rather
different and quicker computation). These behaviors are quite
repeatable.
Run from NFS mounted home dir - no problem. Run from node-local
scratch
directory - hang. Hang is always in the same place (as far as the
output of
the code, anyway).
The underlying system is Linux with a 2.6.18-128.1.6.el5 kernel
(CentOS 5.3)
on a dual single core Opteron system with Mellanox Infiniband SDR
cards.
One note of caution is that I'm running OFED 1.4.1-rc4, because I need
1.4.1
for compatibility with this kernel as far as I can tell.
The code is complicated, the input files are big and lead to long
computation
times, so I don't think I'll be able to make a simple test case.
Instead
I attached to the hanging processes (all 8 of them) with gdb
during the hang. The stack trace is below. Nodes seem to spend
most of
their time in the btl_openib_component_progress(), and occasionally
in
mca_pml_ob1_progress(). I.e. not completely stuck, but not making
progress.
Does anyone have any ideas what could be wrong?
Noam
P.S. I get a similar hang with MVAPICH, in a nearby but different part
of the
code (on an MPI_Bcast, specifically), increasing my tendency to
believe
that it's OFED's fault. But maybe the stack trace will suggest to
someone
where it might be stuck, and therefore perhaps an mca flag to try?
#0 0x00002ac2d19d7733 in btl_openib_component_progress () from /
share/
apps/mpi/openmpi-1.3.2/intel-11.0.083/lib/openmpi/mca_btl_openib.so
#1 0x00002ac2cdd4daea in opal_progress () from /share/apps/mpi/
openmpi-1.3.2/intel-11.0.083/lib/libopen-pal.so.0
#2 0x00002ac2cd887e55 in ompi_request_default_wait_all () from /
share/
apps/mpi/openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0
#3 0x00002ac2d2eb544f in
ompi_coll_tuned_allreduce_intra_recursivedoubling () from /share/apps/
mpi/openmpi-1.3.2/intel-11.0.083/lib/openmpi/mca_coll_tuned.so
#4 0x00002ac2cd89b867 in PMPI_Allreduce () from /share/apps/mpi/
openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0
#5 0x00002ac2cd6429b5 in pmpi_allreduce__ () from /share/apps/mpi/
openmpi-1.3.2/intel-11.0.083/lib/libmpi_f77.so.0
#6 0x000000000077e7db in message_passing_mp_mp_sum_r1_ ()
#7 0x0000000000be67dd in
sparse_matrix_types_mp_cp_sm_sm_trace_scalar_ ()
#8 0x000000000160b68c in
qs_initial_guess_mp_calculate_first_density_matrix_ ()
#9 0x0000000000a7ec05 in qs_scf_mp_scf_env_initial_rho_setup_ ()
#10 0x0000000000a79fca in qs_scf_mp_init_scf_run_ ()
#11 0x0000000000a659fd in qs_scf_mp_scf_ ()
#12 0x00000000008c5713 in qs_energy_mp_qs_energies_ ()
#13 0x00000000008d469e in qs_force_mp_qs_forces_ ()
#14 0x00000000005368bb in
force_env_methods_mp_force_env_calc_energy_force_ ()
#15 0x000000000053620e in
force_env_methods_mp_force_env_calc_energy_force_ ()
#16 0x0000000000742724 in md_run_mp_qs_mol_dyn_ ()
#17 0x0000000000489c42 in cp2k_runs_mp_cp2k_run_ ()
#18 0x000000000048878a in cp2k_runs_mp_run_input_ ()
#19 0x0000000000487669 in MAIN__ ()
#20 0x000000000048667c in main ()
#0 0x00002b4d0b57bf09 in mca_pml_ob1_progress () from /share/apps/
mpi/
openmpi-1.3.2/intel-11.0.083/lib/openmpi/mca_pml_ob1.so
#1 0x00002b4d08538aea in opal_progress () from /share/apps/mpi/
openmpi-1.3.2/intel-11.0.083/lib/libopen-pal.so.0
#2 0x00002b4d08072e55 in ompi_request_default_wait_all () from /
share/
apps/mpi/openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0
#3 0x00002b4d0d6a044f in
ompi_coll_tuned_allreduce_intra_recursivedoubling () from /share/apps/
mpi/openmpi-1.3.2/intel-11.0.083/lib/openmpi/mca_coll_tuned.so
#4 0x00002b4d08086867 in PMPI_Allreduce () from /share/apps/mpi/
openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0
#5 0x00002b4d07e2d9b5 in pmpi_allreduce__ () from /share/apps/mpi/
openmpi-1.3.2/intel-11.0.083/lib/libmpi_f77.so.0
#6 0x000000000077e7db in message_passing_mp_mp_sum_r1_ ()
#7 0x0000000000be67dd in
sparse_matrix_types_mp_cp_sm_sm_trace_scalar_ ()
#8 0x000000000160b68c in
qs_initial_guess_mp_calculate_first_density_matrix_ ()
#9 0x0000000000a7ec05 in qs_scf_mp_scf_env_initial_rho_setup_ ()
#10 0x0000000000a79fca in qs_scf_mp_init_scf_run_ ()
#11 0x0000000000a659fd in qs_scf_mp_scf_ ()
#12 0x00000000008c5713 in qs_energy_mp_qs_energies_ ()
#13 0x00000000008d469e in qs_force_mp_qs_forces_ ()
#14 0x00000000005368bb in
force_env_methods_mp_force_env_calc_energy_force_ ()
#15 0x000000000053620e in
force_env_methods_mp_force_env_calc_energy_force_ ()
#16 0x0000000000742724 in md_run_mp_qs_mol_dyn_ ()
#17 0x0000000000489c42 in cp2k_runs_mp_cp2k_run_ ()
#18 0x000000000048878a in cp2k_runs_mp_run_input_ ()
#19 0x0000000000487669 in MAIN__ ()
#20 0x000000000048667c in main ()
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems