Hi Brian

On Aug 16, 2008, at 1:40 PM, Brian Dobbins wrote:

Hi guys,

I was hoping someone here could shed some light on OpenMPI's use of /tmp (or, I guess, TMPDIR) and save me from diving into the source.. ;)

The background is that I'm trying to run some applications on a system which has a flaky parallel file system which TMPDIR is mapped to - so, on start up, OpenMPI creates it's 'openmpi-sessions-<user>' directory there, and under that, a few files. Sometimes I see 1 subdirectory (usually a 0), sometimes a 0 and a 1, etc. In one of these, sometimes I see files such as 'shared_memory_pool.<host>', and 'shared_memory_module.<host>'.

My questions are, one, what are the various numbers / files for? (If there's a write-up on this somewhere, just point me towards it!)

The numbers just correspond to the jobid and vpid of the processes on the node. We use them to ensure that each process has its own "trusted" location where it can store tmp files without concerns for stepping on each other. These directories generally do not get used except for storing the shared memory files and for debugging output in the case of an internal OMPI error.

The shared_memory files are backing files for shared memory operations.



And two, the real question, are these 'files' used during runtime, or only upon startup / shutdown? I'm having issues with various codes, especially those heavy on messages and I/O, failing to complete a run, and haven't resorted to sifting through strace's output yet. This doesn't happen all the time, but I've seen it happen reliably now with one particular code - it's success rate (it DOES succeed sometimes) is about 25% right now. My best guess is that this is because the file system is overloaded, thus not allowing timely I/O or access to OpenMPI's files, but I wanted to get a quick understanding of how these files are used by OpenMPI and whether the FS does indeed seem a likely culprit before going with that theory for sure.

I would guess that you are having a problem with shared memory operations. Try using "-mca btl ^sm" on your cmd line to turn off shared memory and see if your success rate goes up - if so, then you have identified the problem!

Ralph



  Thanks very much,
  - Brian


Brian Dobbins
Yale Engineering HPC
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to