On Feb 3, 2014, at 1:13 PM, Eric Chamberland <eric.chamberl...@giref.ulaval.ca> wrote:
> On 02/03/2014 03:59 PM, Ralph Castain wrote: >> Very strange - even if you kill the job with SIGTERM, or have processes that >> segfault, OMPI should clean itself up and remove those session directories. >> Granted, the 1.6 series isn't as good about doing so as the 1.7 series, but >> it at least to-date has done pretty well. > > Ok, one more information here that may matter: All sequential tests are > launched *without* mpiexec... I don't know if the "cleanup" phase is done by > mpiexec or the binaries... Ah, yes that would be a source of the problem! We can't guarantee cleanup if you just kill the procs or they segfault *unless* mpiexec is used to launch the job. What are you using to launch? Most resource managers provide an "epilog" capability for precisely this purpose as all MPIs would display the same issue. > >> >> Best I can suggest for now is to do the following in your test script: >> >> (1) set TMPDIR=</tmp/regression> >> >> (2) run your tests >> >> (3) rm -rf /tmp/regression/* >> >> That will ensure you only blow away the session dirs from your regression >> tests. Hopefully, you'll find the directory empty more often than not... > > Ok, I just added: > > find /tmp/openmpi-sessions-${USER}* -maxdepth 1 -type f -exec rm {} \; > > which should delete files that shouldn't exists... ;-) > > But, IMHO, I still think OpenMPI should "choose" another directory name if it > can't create it because a poor file exists! We could do that - but now we get into the bottomless pit of trying every possible combination of directory names, and ensuring that every process comes up with the same answer! Remember, the session dir is where the shared memory regions rendezvous, so every process on a node would have to find the same place > > How can all users be aware that they have to cleanup such files? Given how long 1.6.x has been out there, and that this is about the only time I've heard of a problem, I'm not sure this is a general enough issue to merit the concern > > Maybe a good compromise would be to have the error message to tell there is a > file with the same name of the directory chosen? I can make that change - good suggestion. > > Or add a new entry to the FAQ to help users find the workaround you > proposed... ;-) we can try to do that too > > thanks again! > > Eric > >> >> HTH >> Ralph >> >> On Feb 3, 2014, at 12:31 PM, Eric Chamberland >> <eric.chamberl...@giref.ulaval.ca> wrote: >> >>> Hi, >>> >>> On 02/03/2014 03:09 PM, Ralph Castain wrote: >>>> OMPI will error out in that case, as you originally reported. What seems >>>> to be happening is that you have a bunch of stale session directories, but >>>> I'm puzzled because the creation dates are so current - for whatever >>>> reason, OMPI seems to be getting the same jobid much more often than it >>>> should. Can you tell me something about the environment - e.g., is it >>>> managed or just using hostfile? >>> >>> This computer is used about 11 times a day to launch about 1500 executions >>> on our in-house (finite element) code. >>> >>> We do launch at most 12 single process executions at the same time, but we >>> use PETSc, which always initialize the MPI environment... >>> >>> Also, we are launching some tests which use between 2 to 128 processes (on >>> the same computer) just to ensure proper code testing. In fact, >>> performance is not quit an issue in these 128 processes tests and we set >>> the following environment variable: >>> >>> export OMPI_MCA_mpi_yield_when_idle=1 >>> >>> because we encountered timeout problems before... >>> >>> The whole testing lasts about 1 hour and the result is used to give a >>> feed-back for users who "pushed" modifications to the code.... >>> >>> So I would add: sometime the tests may be interrupted by segfaults, "kill >>> -TERM" or anything you can imagine... The problem now is that it won't >>> even start if a mere file exists... >>> >>> I can flush those files right now, but I am almost sure they will reappear >>> it the following days, leading to false "bad results" for the tests... and >>> I will have to setup a cleanup procedure before launching all the tests... >>> But that will not prevent the fact that those files may be created while >>> running the firsts of the 1500 tests and have 1 or some of the rest to >>> fail.... >>> >>> I hope this is the information you wanted... Is it? >>> >>> Thanks, >>> >>> Eric >>> >>> >>>> >>>> >>>> On Feb 3, 2014, at 12:00 PM, Eric Chamberland >>>> <eric.chamberl...@giref.ulaval.ca> wrote: >>>> >>>>> On 02/03/2014 02:49 PM, Ralph Castain wrote: >>>>>> Seems rather odd - is your /tmp by any chance network mounted? >>>>> >>>>> No it is a "normal" /tmp: >>>>> >>>>> "cd /tmp; df -h ." gives: >>>>> >>>>> Filesystem Size Used Avail Use% Mounted on >>>>> /dev/sda1 49G 17G 30G 37% / >>>>> >>>>> And there is plenty of disk space... >>>>> >>>>> I agree it is odd, but how should OpenMPI react when trying to create a >>>>> directory over an existing file name? I mean what is it programmed to do? >>>>> >>>>> Thanks, >>>>> >>>>> Eric >>>>> >>> >