On Apr 14, 2011, at 9:22 AM, Rushton Martin wrote:

> For your information: we were supplied with a script when we bought the
> cluster, but the original script made the assumption that all processes
> and shm files belonging to a specific user ought to be deleted.  This is
> a problem if users submit jobs which only half fill a node and the
> second job starts on the same node as the first one.  The first job to
> finish causes the continuing job to stop dead.  We therefore had to
> disable any cleanup to allow jobs to run.  Now we are finding a slow
> fill up with the shm files and I need to do something; at least now I
> have a way forward.

Note that Open MPI v1.4.x is likely using mmap files by default -- these should 
be under /tmp/ somewhere.  If they get left around, they can cause shared 
memory to be filled up, but they should also be unrelated in /dev/shm kinds of 
things.  If you're seeing /dev/shm fill up, that might be due to something else.

Also, I'm a little confused by your reference to psm_shm... are you talking 
about the QLogic PSM device?  If that does some tomfoolery with /dev/shm 
somewhere, I'm unaware of it (i.e., I don't know much/anything about what that 
device does internally).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to