Hi Brian
On Aug 16, 2008, at 1:40 PM, Brian Dobbins wrote:
Hi guys,
I was hoping someone here could shed some light on OpenMPI's use
of /tmp (or, I guess, TMPDIR) and save me from diving into the
source.. ;)
The background is that I'm trying to run some applications on a
system which has a flaky parallel file system which TMPDIR is mapped
to - so, on start up, OpenMPI creates it's 'openmpi-sessions-<user>'
directory there, and under that, a few files. Sometimes I see 1
subdirectory (usually a 0), sometimes a 0 and a 1, etc. In one of
these, sometimes I see files such as 'shared_memory_pool.<host>',
and 'shared_memory_module.<host>'.
My questions are, one, what are the various numbers / files for?
(If there's a write-up on this somewhere, just point me towards it!)
The numbers just correspond to the jobid and vpid of the processes on
the node. We use them to ensure that each process has its own
"trusted" location where it can store tmp files without concerns for
stepping on each other. These directories generally do not get used
except for storing the shared memory files and for debugging output in
the case of an internal OMPI error.
The shared_memory files are backing files for shared memory operations.
And two, the real question, are these 'files' used during runtime,
or only upon startup / shutdown? I'm having issues with various
codes, especially those heavy on messages and I/O, failing to
complete a run, and haven't resorted to sifting through strace's
output yet. This doesn't happen all the time, but I've seen it
happen reliably now with one particular code - it's success rate (it
DOES succeed sometimes) is about 25% right now. My best guess is
that this is because the file system is overloaded, thus not
allowing timely I/O or access to OpenMPI's files, but I wanted to
get a quick understanding of how these files are used by OpenMPI and
whether the FS does indeed seem a likely culprit before going with
that theory for sure.
I would guess that you are having a problem with shared memory
operations. Try using "-mca btl ^sm" on your cmd line to turn off
shared memory and see if your success rate goes up - if so, then you
have identified the problem!
Ralph
Thanks very much,
- Brian
Brian Dobbins
Yale Engineering HPC
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users