We've gotten a few reports of problems with memory debugging when using OpenMPI under TotalView. Usually, TotalView will attach tot he processes started after an MPI_Init. However in the case where memory debugging is enabled, things seemed to run away or fail. My analysis showed that we had a number of core files left over from the attempt, and all were mpirun (or orterun) cores. It seemed to be a regression on our part, since testing seemed to indicate this worked okay before TotalView 8.9.0-0, so I filed an internal bug and passed it to engineering. After giving our engineer a brief tutorial on how to build a debug version of OpenMPI, he found what appears to be a problem in the code for orterun.c. He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, those being the versions he's tested with so far. He doesn't subscribe to this list that I know of, so I offered to pass this by the group. Of course, I'm not sure if this is exactly the right place to submit patches, but I'm sure you'd tell me where to put it if I'm in the wrong here. It's a short patch, so I'll cut and paste it, and attach as well, since cut and paste can do weird things to formatting.

Credit goes to Ariel Burton for this patch. Of course he used TotalVIew to find this ;-) It shows up if you do 'mpirun -tv -np 4 ./foo' or 'totalview mpirun -a -np 4 ./foo'

Cheers,
PeterT


more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c        2010-04-13 13:30:34.000000000 -0400
--- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../. ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c 2011-05-09 20:28:16.5881
83000 -0400
***************
*** 1578,1588 ****
     }

     if (NULL != env) {
         size1 = opal_argv_count(env);
         for (j = 0; j < size1; ++j) {
!             putenv(env[j]);
         }
     }

     /* All done */

--- 1578,1600 ----
     }

     if (NULL != env) {
         size1 = opal_argv_count(env);
         for (j = 0; j < size1; ++j) {
!             /* Use-after-Free error possible here.  putenv does not copy
! the string passed to it, and instead stores only the pointer.
!                env[j] may be freed later, in which case the pointer
!                in environ will now be left dangling into a deallocated
!                region.
!                So we make a copy of the variable.
!             */
!             char *s = strdup(env[j]);
!
!             if (NULL == s) {
!                 return OPAL_ERR_OUT_OF_RESOURCE;
!             }
!             putenv(s);
         }
     }

     /* All done */


*** orte/tools/orterun/orterun.c        2010-04-13 13:30:34.000000000 -0400
--- 
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
        2011-05-09 20:28:16.588183000 -0400
***************
*** 1578,1588 ****
      }

      if (NULL != env) {
          size1 = opal_argv_count(env);
          for (j = 0; j < size1; ++j) {
!             putenv(env[j]);
          }
      }

      /* All done */

--- 1578,1600 ----
      }

      if (NULL != env) {
          size1 = opal_argv_count(env);
          for (j = 0; j < size1; ++j) {
!             /* Use-after-Free error possible here.  putenv does not copy
!                the string passed to it, and instead stores only the pointer.
!                env[j] may be freed later, in which case the pointer
!                in environ will now be left dangling into a deallocated
!                region.
!                So we make a copy of the variable.
!             */
!             char *s = strdup(env[j]);
! 
!             if (NULL == s) {
!                 return OPAL_ERR_OUT_OF_RESOURCE;
!             }
!             putenv(s);
          }
      }

      /* All done */

Reply via email to