Guess I'm having trouble reading your diff...different notation than I'm used to seeing. I'll have to parse thru it when I have more time.
On May 16, 2011, at 1:02 PM, Peter Thompson wrote: > Hmmm? We're not removing the putenv() calls. Just adding a strdup() > beforehand, and then calling putenv() with the string duplicated from env[j]. > Of course, if the strdup fails, then we bail out. > As for why it's suddenly a problem, I'm not quite as certain. The problem > we do show is a double free, so someone has already freed that memory used by > putenv(), and I do know that while that used to be just flagged as an event > before, now we seem to be unable to continue past it. Not sure if that is > our change or a library/system change. > PeterT > > > Ralph Castain wrote: >> On May 16, 2011, at 12:45 PM, Peter Thompson wrote: >> >> >>> Hi Ralph, >>> >>> We've had a number of user complaints about this. Since it seems on the >>> face of it that it is a debugger issue, it may have not made it's way back >>> here. Is your objection that the patch basically aborts if it gets a bad >>> value? I could understand that being a concern. Of course, it aborts on >>> TotalView now if we attempt to move forward without this patch. >>> >>> >> >> No - my concern is that you appear to be removing the "putenv" calls. OMPI >> places some values into the local environment so the user can control >> behavior. Removing those causes problems. >> >> What I need to know is why, after it has worked with TV for years, these >> putenv's are suddenly a problem. Is the problem occurring during shutdown? >> Or is this something that causes TV to break? >> >> >> >>> I've passed your comment back to the engineer, with a suspicion about the >>> concerns about the abort, but if you have other objections, let me know. >>> >>> Cheers, >>> PeterT >>> >>> >>> Ralph Castain wrote: >>> >>>> That would be a problem, I fear. We need to push those envars into the >>>> environment. >>>> >>>> Is there some particular problem causing what you see? We have no other >>>> reports of this issue, and orterun has had that code forever. >>>> >>>> >>>> >>>> Sent from my iPad >>>> >>>> On May 11, 2011, at 2:05 PM, Peter Thompson <peter.thomp...@roguewave.com> >>>> wrote: >>>> >>>> >>>>> We've gotten a few reports of problems with memory debugging when using >>>>> OpenMPI under TotalView. Usually, TotalView will attach tot he processes >>>>> started after an MPI_Init. However in the case where memory debugging is >>>>> enabled, things seemed to run away or fail. My analysis showed that we >>>>> had a number of core files left over from the attempt, and all were >>>>> mpirun (or orterun) cores. It seemed to be a regression on our part, >>>>> since testing seemed to indicate this worked okay before TotalView >>>>> 8.9.0-0, so I filed an internal bug and passed it to engineering. After >>>>> giving our engineer a brief tutorial on how to build a debug version of >>>>> OpenMPI, he found what appears to be a problem in the code for orterun.c. >>>>> He's made a slight change that fixes the issue in 1.4.2, 1.4.3, >>>>> 1.4.4rc2 and 1.5.3, those being the versions he's tested with so far. >>>>> He doesn't subscribe to this list that I know of, so I offered to pass >>>>> this by the group. Of course, I'm not sure if this is exactly the right >>>>> place to submit patches, but I'm sure you'd tell me where to put it if >>>>> I'm in the wrong here. It's a short patch, so I'll cut and paste it, >>>>> and attach as well, since cut and paste can do weird things to formatting. >>>>> >>>>> Credit goes to Ariel Burton for this patch. Of course he used TotalVIew >>>>> to find this ;-) It shows up if you do 'mpirun -tv -np 4 ./foo' or >>>>> 'totalview mpirun -a -np 4 ./foo' >>>>> >>>>> Cheers, >>>>> PeterT >>>>> >>>>> >>>>> more ~/patches/anbs-patch >>>>> *** orte/tools/orterun/orterun.c 2010-04-13 13:30:34.000000000 >>>>> -0400 >>>>> --- >>>>> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../. >>>>> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c 2011-05-09 >>>>> 20:28:16.5881 >>>>> 83000 -0400 >>>>> *************** >>>>> *** 1578,1588 **** >>>>> } >>>>> if (NULL != env) { >>>>> size1 = opal_argv_count(env); >>>>> for (j = 0; j < size1; ++j) { >>>>> ! putenv(env[j]); >>>>> } >>>>> } >>>>> /* All done */ >>>>> --- 1578,1600 ---- >>>>> } >>>>> if (NULL != env) { >>>>> size1 = opal_argv_count(env); >>>>> for (j = 0; j < size1; ++j) { >>>>> ! /* Use-after-Free error possible here. putenv does not copy >>>>> ! the string passed to it, and instead stores only the >>>>> pointer. >>>>> ! env[j] may be freed later, in which case the pointer >>>>> ! in environ will now be left dangling into a deallocated >>>>> ! region. >>>>> ! So we make a copy of the variable. >>>>> ! */ >>>>> ! char *s = strdup(env[j]); >>>>> ! >>>>> ! if (NULL == s) { >>>>> ! return OPAL_ERR_OUT_OF_RESOURCE; >>>>> ! } >>>>> ! putenv(s); >>>>> } >>>>> } >>>>> /* All done */ >>>>> >>>>> *** orte/tools/orterun/orterun.c 2010-04-13 13:30:34.000000000 -0400 >>>>> --- >>>>> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c >>>>> 2011-05-09 20:28:16.588183000 -0400 >>>>> *************** >>>>> *** 1578,1588 **** >>>>> } >>>>> >>>>> if (NULL != env) { >>>>> size1 = opal_argv_count(env); >>>>> for (j = 0; j < size1; ++j) { >>>>> ! putenv(env[j]); >>>>> } >>>>> } >>>>> >>>>> /* All done */ >>>>> >>>>> --- 1578,1600 ---- >>>>> } >>>>> >>>>> if (NULL != env) { >>>>> size1 = opal_argv_count(env); >>>>> for (j = 0; j < size1; ++j) { >>>>> ! /* Use-after-Free error possible here. putenv does not copy >>>>> ! the string passed to it, and instead stores only the >>>>> pointer. >>>>> ! env[j] may be freed later, in which case the pointer >>>>> ! in environ will now be left dangling into a deallocated >>>>> ! region. >>>>> ! So we make a copy of the variable. >>>>> ! */ >>>>> ! char *s = strdup(env[j]); >>>>> ! ! if (NULL == s) { >>>>> ! return OPAL_ERR_OUT_OF_RESOURCE; >>>>> ! } >>>>> ! putenv(s); >>>>> } >>>>> } >>>>> >>>>> /* All done */ >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >> >> >