Hi all - I have a bizarre OpenMPI hanging problem. I'm running an MPI code called CP2K (related to, but not the same as cpmd). The complications of the
software aside, here are the observations:

At the base is a serial code that uses system() calls to repeatedly invoke
   mpirun cp2k.popt.
When I run from my NFS mounted home directory, everything appears to be
fine. When I run from a scratch directory local to each node, it hangs on the _third_ invokation of CP2K (the 1st and 3rd invokations do computationally expensive stuff, the 2nd uses the code in a different mode which does a rather different and quicker computation). These behaviors are quite repeatable.
Run from NFS mounted home dir - no problem.  Run from node-local scratch
directory - hang. Hang is always in the same place (as far as the output of
the code, anyway).

The underlying system is Linux with a 2.6.18-128.1.6.el5 kernel (CentOS 5.3)
on a dual single core Opteron system with Mellanox Infiniband SDR cards.
One note of caution is that I'm running OFED 1.4.1-rc4, because I need 1.4.1
for compatibility with this kernel as far as I can tell.

The code is complicated, the input files are big and lead to long computation times, so I don't think I'll be able to make a simple test case. Instead
I attached to the hanging processes (all 8 of them) with gdb
during  the hang. The stack trace is below.  Nodes seem to spend most of
their time in the  btl_openib_component_progress(), and occasionally in
mca_pml_ob1_progress(). I.e. not completely stuck, but not making progress.

Does anyone have any ideas what could be wrong?

                                                                                
                        Noam

P.S. I get a similar hang with MVAPICH, in a nearby but different part of the
code (on an MPI_Bcast, specifically), increasing my tendency to believe
that it's OFED's fault. But maybe the stack trace will suggest to someone
where it might be stuck, and therefore perhaps an mca flag to try?


#0 0x00002ac2d19d7733 in btl_openib_component_progress () from /share/ apps/mpi/openmpi-1.3.2/intel-11.0.083/lib/openmpi/mca_btl_openib.so #1 0x00002ac2cdd4daea in opal_progress () from /share/apps/mpi/ openmpi-1.3.2/intel-11.0.083/lib/libopen-pal.so.0 #2 0x00002ac2cd887e55 in ompi_request_default_wait_all () from /share/ apps/mpi/openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0 #3 0x00002ac2d2eb544f in ompi_coll_tuned_allreduce_intra_recursivedoubling () from /share/apps/ mpi/openmpi-1.3.2/intel-11.0.083/lib/openmpi/mca_coll_tuned.so #4 0x00002ac2cd89b867 in PMPI_Allreduce () from /share/apps/mpi/ openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0 #5 0x00002ac2cd6429b5 in pmpi_allreduce__ () from /share/apps/mpi/ openmpi-1.3.2/intel-11.0.083/lib/libmpi_f77.so.0
#6  0x000000000077e7db in message_passing_mp_mp_sum_r1_ ()
#7 0x0000000000be67dd in sparse_matrix_types_mp_cp_sm_sm_trace_scalar_ () #8 0x000000000160b68c in qs_initial_guess_mp_calculate_first_density_matrix_ ()
#9  0x0000000000a7ec05 in qs_scf_mp_scf_env_initial_rho_setup_ ()
#10 0x0000000000a79fca in qs_scf_mp_init_scf_run_ ()
#11 0x0000000000a659fd in qs_scf_mp_scf_ ()
#12 0x00000000008c5713 in qs_energy_mp_qs_energies_ ()
#13 0x00000000008d469e in qs_force_mp_qs_forces_ ()
#14 0x00000000005368bb in force_env_methods_mp_force_env_calc_energy_force_ () #15 0x000000000053620e in force_env_methods_mp_force_env_calc_energy_force_ ()
#16 0x0000000000742724 in md_run_mp_qs_mol_dyn_ ()
#17 0x0000000000489c42 in cp2k_runs_mp_cp2k_run_ ()
#18 0x000000000048878a in cp2k_runs_mp_run_input_ ()
#19 0x0000000000487669 in MAIN__ ()
#20 0x000000000048667c in main ()






#0 0x00002b4d0b57bf09 in mca_pml_ob1_progress () from /share/apps/mpi/ openmpi-1.3.2/intel-11.0.083/lib/openmpi/mca_pml_ob1.so #1 0x00002b4d08538aea in opal_progress () from /share/apps/mpi/ openmpi-1.3.2/intel-11.0.083/lib/libopen-pal.so.0 #2 0x00002b4d08072e55 in ompi_request_default_wait_all () from /share/ apps/mpi/openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0 #3 0x00002b4d0d6a044f in ompi_coll_tuned_allreduce_intra_recursivedoubling () from /share/apps/ mpi/openmpi-1.3.2/intel-11.0.083/lib/openmpi/mca_coll_tuned.so #4 0x00002b4d08086867 in PMPI_Allreduce () from /share/apps/mpi/ openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0 #5 0x00002b4d07e2d9b5 in pmpi_allreduce__ () from /share/apps/mpi/ openmpi-1.3.2/intel-11.0.083/lib/libmpi_f77.so.0
#6  0x000000000077e7db in message_passing_mp_mp_sum_r1_ ()
#7 0x0000000000be67dd in sparse_matrix_types_mp_cp_sm_sm_trace_scalar_ () #8 0x000000000160b68c in qs_initial_guess_mp_calculate_first_density_matrix_ ()
#9  0x0000000000a7ec05 in qs_scf_mp_scf_env_initial_rho_setup_ ()
#10 0x0000000000a79fca in qs_scf_mp_init_scf_run_ ()
#11 0x0000000000a659fd in qs_scf_mp_scf_ ()
#12 0x00000000008c5713 in qs_energy_mp_qs_energies_ ()
#13 0x00000000008d469e in qs_force_mp_qs_forces_ ()
#14 0x00000000005368bb in force_env_methods_mp_force_env_calc_energy_force_ () #15 0x000000000053620e in force_env_methods_mp_force_env_calc_energy_force_ ()
#16 0x0000000000742724 in md_run_mp_qs_mol_dyn_ ()
#17 0x0000000000489c42 in cp2k_runs_mp_cp2k_run_ ()
#18 0x000000000048878a in cp2k_runs_mp_run_input_ ()
#19 0x0000000000487669 in MAIN__ ()
#20 0x000000000048667c in main ()

Reply via email to