We've gotten a few reports of problems with memory debugging when using
OpenMPI under TotalView. Usually, TotalView will attach tot he
processes started after an MPI_Init. However in the case where memory
debugging is enabled, things seemed to run away or fail. My analysis
showed that we had a number of core files left over from the attempt,
and all were mpirun (or orterun) cores. It seemed to be a regression
on our part, since testing seemed to indicate this worked okay before
TotalView 8.9.0-0, so I filed an internal bug and passed it to
engineering. After giving our engineer a brief tutorial on how to
build a debug version of OpenMPI, he found what appears to be a problem
in the code for orterun.c. He's made a slight change that fixes the
issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, those being the versions he's
tested with so far. He doesn't subscribe to this list that I know of,
so I offered to pass this by the group. Of course, I'm not sure if
this is exactly the right place to submit patches, but I'm sure you'd
tell me where to put it if I'm in the wrong here. It's a short patch,
so I'll cut and paste it, and attach as well, since cut and paste can do
weird things to formatting.
Credit goes to Ariel Burton for this patch. Of course he used TotalVIew
to find this ;-) It shows up if you do 'mpirun -tv -np 4 ./foo' or
'totalview mpirun -a -np 4 ./foo'
Cheers,
PeterT
more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c 2010-04-13 13:30:34.000000000 -0400
---
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
./src/openmpi-1.4.2/orte/tools/orterun/orterun.c 2011-05-09
20:28:16.5881
83000 -0400
***************
*** 1578,1588 ****
}
if (NULL != env) {
size1 = opal_argv_count(env);
for (j = 0; j < size1; ++j) {
! putenv(env[j]);
}
}
/* All done */
--- 1578,1600 ----
}
if (NULL != env) {
size1 = opal_argv_count(env);
for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here. putenv does not copy
! the string passed to it, and instead stores only the
pointer.
! env[j] may be freed later, in which case the pointer
! in environ will now be left dangling into a deallocated
! region.
! So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
!
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
}
}
/* All done */
*** orte/tools/orterun/orterun.c 2010-04-13 13:30:34.000000000 -0400
---
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
2011-05-09 20:28:16.588183000 -0400
***************
*** 1578,1588 ****
}
if (NULL != env) {
size1 = opal_argv_count(env);
for (j = 0; j < size1; ++j) {
! putenv(env[j]);
}
}
/* All done */
--- 1578,1600 ----
}
if (NULL != env) {
size1 = opal_argv_count(env);
for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here. putenv does not copy
! the string passed to it, and instead stores only the pointer.
! env[j] may be freed later, in which case the pointer
! in environ will now be left dangling into a deallocated
! region.
! So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
!
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
}
}
/* All done */