Our scenario is that we are running python, then importing a module
written in Fortran.
We run via:
mpiexec -n 8 -x PYTHONPATH -x SIDL_DLL_PATH python tokHsmNP8.py
where the script calls into Fortran to call MPI_Init.
On 8 procs (but not one) we get hangs in the code (on some machines but
not others!).
Hard to tell precisely where, because it is in a PETSc method.
Running with valgrind
mpiexec -n 8 -x PYTHONPATH -x SIDL_DLL_PATH valgrind python tokHsmNP8.py
gives a crash, with some salient output:
==936==
==936== Syscall param sched_setaffinity(mask) points to unaddressable
byte(s)
==936== at 0x39336DAA79: syscall (in /lib64/libc-2.10.1.so)
==936== by 0x10BCBD58: opal_paffinity_linux_plpa_api_probe_init (in
/usr/local/openmpi-1.3.2-notorque/lib/libopen-pal.so.0.0.0)
==936== by 0x10BCE054: opal_paffinity_linux_plpa_init (in
/usr/local/openmpi-1.3.2-notorque/lib/libopen-pal.so.0.0.0)
==936== by 0x10BCC9F9:
opal_paffinity_linux_plpa_have_topology_information (in
/usr/local/openmpi-1.3.2-notorque/lib/libopen-pal.so.0.0.0)
==936== by 0x10BCBBFF: linux_module_init (in
/usr/local/openmpi-1.3.2-notorque/lib/libopen-pal.so.0.0.0)
==936== by 0x10BC99C3: opal_paffinity_base_select (in
/usr/local/openmpi-1.3.2-notorque/lib/libopen-pal.so.0.0.0)
==936== by 0x10B9DB83: opal_init (in
/usr/local/openmpi-1.3.2-notorque/lib/libopen-pal.so.0.0.0)
==936== by 0x10920C6C: orte_init (in
/usr/local/openmpi-1.3.2-notorque/lib/libopen-rte.so.0.0.0)
==936== by 0x10579D06: ompi_mpi_init (in
/usr/local/openmpi-1.3.2-notorque/lib/libmpi.so.0.0.0)
==936== by 0x10599175: PMPI_Init (in
/usr/local/openmpi-1.3.2-notorque/lib/libmpi.so.0.0.0)
==936== by 0x10E2BDF4: mpi_init (in
/usr/local/openmpi-1.3.2-notorque/lib/libmpi_f77.so.0.0.0)
==936== by 0xDF30A1F: uedge_mpiinit_ (in
/home/research/cary/projects/facetsall-iter/physics/uedge/par/build/uedgeC.so)
==936== Address 0x0 is not stack'd, malloc'd or (recently) free'd
This makes me think that our call to mpi_init is wrong. At
http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Init.html
it says
Because the Fortran and C versions of MPI_Init
<http://www.mpi-forum.org/docs/mpi-11-html/node151.html#node151> are
different, there is a restriction on who can call MPI_Init
<http://www.mpi-forum.org/docs/mpi-11-html/node151.html#node151>. The
version (Fortran or C) must match the main
program. That is, if the main program is in C, then the C version of
MPI_Init
<http://www.mpi-forum.org/docs/mpi-11-html/node151.html#node151> must be
called. If the main program is in Fortran, the Fortran version must be
called.
Should I infer from this that since python is a C code, one must call
the C version of MPI_Init (with argc, argv)?
Or since the module is written mostly in Fortran with mpi calls of only
the Fortran variety, I can initialize
with the Fortran MPI_Init?
Thanks.....John Cary