Our scenario is that we are running python, then importing a module written in Fortran.
We run via:

mpiexec -n 8 -x PYTHONPATH -x SIDL_DLL_PATH python tokHsmNP8.py

where the script calls into Fortran to call MPI_Init.

On 8 procs (but not one) we get hangs in the code (on some machines but not others!). Hard to tell precisely where, because it is in a PETSc method.

Running with valgrind

mpiexec -n 8 -x PYTHONPATH -x SIDL_DLL_PATH valgrind python tokHsmNP8.py

gives a crash, with some salient output:

==936==
==936== Syscall param sched_setaffinity(mask) points to unaddressable byte(s)
==936==    at 0x39336DAA79: syscall (in /lib64/libc-2.10.1.so)
==936== by 0x10BCBD58: opal_paffinity_linux_plpa_api_probe_init (in /usr/local/openmpi-1.3.2-notorque/lib/libopen-pal.so.0.0.0) ==936== by 0x10BCE054: opal_paffinity_linux_plpa_init (in /usr/local/openmpi-1.3.2-notorque/lib/libopen-pal.so.0.0.0) ==936== by 0x10BCC9F9: opal_paffinity_linux_plpa_have_topology_information (in /usr/local/openmpi-1.3.2-notorque/lib/libopen-pal.so.0.0.0) ==936== by 0x10BCBBFF: linux_module_init (in /usr/local/openmpi-1.3.2-notorque/lib/libopen-pal.so.0.0.0) ==936== by 0x10BC99C3: opal_paffinity_base_select (in /usr/local/openmpi-1.3.2-notorque/lib/libopen-pal.so.0.0.0) ==936== by 0x10B9DB83: opal_init (in /usr/local/openmpi-1.3.2-notorque/lib/libopen-pal.so.0.0.0) ==936== by 0x10920C6C: orte_init (in /usr/local/openmpi-1.3.2-notorque/lib/libopen-rte.so.0.0.0) ==936== by 0x10579D06: ompi_mpi_init (in /usr/local/openmpi-1.3.2-notorque/lib/libmpi.so.0.0.0) ==936== by 0x10599175: PMPI_Init (in /usr/local/openmpi-1.3.2-notorque/lib/libmpi.so.0.0.0) ==936== by 0x10E2BDF4: mpi_init (in /usr/local/openmpi-1.3.2-notorque/lib/libmpi_f77.so.0.0.0) ==936== by 0xDF30A1F: uedge_mpiinit_ (in /home/research/cary/projects/facetsall-iter/physics/uedge/par/build/uedgeC.so)
==936==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

This makes me think that our call to mpi_init is wrong.  At

 http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Init.html

it says

Because the Fortran and C versions of MPI_Init <http://www.mpi-forum.org/docs/mpi-11-html/node151.html#node151> are different, there is a restriction on who can call MPI_Init <http://www.mpi-forum.org/docs/mpi-11-html/node151.html#node151>. The version (Fortran or C) must match the main program. That is, if the main program is in C, then the C version of MPI_Init <http://www.mpi-forum.org/docs/mpi-11-html/node151.html#node151> must be called. If the main program is in Fortran, the Fortran version must be called.

Should I infer from this that since python is a C code, one must call the C version of MPI_Init (with argc, argv)?

Or since the module is written mostly in Fortran with mpi calls of only the Fortran variety, I can initialize
with the Fortran MPI_Init?

Thanks.....John Cary






Reply via email to