On Apr 14, 2011, at 9:22 AM, Rushton Martin wrote: > For your information: we were supplied with a script when we bought the > cluster, but the original script made the assumption that all processes > and shm files belonging to a specific user ought to be deleted. This is > a problem if users submit jobs which only half fill a node and the > second job starts on the same node as the first one. The first job to > finish causes the continuing job to stop dead. We therefore had to > disable any cleanup to allow jobs to run. Now we are finding a slow > fill up with the shm files and I need to do something; at least now I > have a way forward.
Note that Open MPI v1.4.x is likely using mmap files by default -- these should be under /tmp/ somewhere. If they get left around, they can cause shared memory to be filled up, but they should also be unrelated in /dev/shm kinds of things. If you're seeing /dev/shm fill up, that might be due to something else. Also, I'm a little confused by your reference to psm_shm... are you talking about the QLogic PSM device? If that does some tomfoolery with /dev/shm somewhere, I'm unaware of it (i.e., I don't know much/anything about what that device does internally). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/