Hmm....if you are willing to keep trying, could you perhaps let it run for a 
brief time, ctrl-z it, and then do an ls on a directory from a process that has 
already terminated? The pids will be in order, so just look for an early number 
(not mpirun or the parent, of course).

It would help if you could give us the contents of a directory from a child 
process that has terminated - would tell us what subsystem is failing to 
properly cleanup.

Thanks - and sorry for the problem.

On Dec 2, 2009, at 2:11 PM, Nicolas Bock wrote:

> 
> 
> On Wed, Dec 2, 2009 at 12:12, Ralph Castain <r...@open-mpi.org> wrote:
> 
> On Dec 2, 2009, at 10:24 AM, Nicolas Bock wrote:
> 
>> 
>> 
>> On Tue, Dec 1, 2009 at 20:58, Nicolas Bock <nicolasb...@gmail.com> wrote:
>> 
>> 
>> On Tue, Dec 1, 2009 at 18:03, Ralph Castain <r...@open-mpi.org> wrote:
>> You may want to check your limits as defined by the shell/system. I can also 
>> run this for as long as I'm willing to let it run, so something else appears 
>> to be going on.
>> 
>> 
>> 
>> Is that with 1.3.3? I found that with 1.3.4 I can run the example much 
>> longer until I hit this error message:
>> 
>> 
>> [master] (31996) forking processes
>> [mujo:14273] opal_os_dirpath_create: Error: Unable to create the 
>> sub-directory 
>> (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998) of 
>> (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998/0), mkdir 
>> failed [1]
>> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file 
>> util/session_dir.c at line 101
>> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file 
>> util/session_dir.c at line 425
>> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file 
>> base/ess_base_std_app.c at line 132
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>> 
>>   orte_session_dir failed
>>   --> Returned value Error (-1) instead of ORTE_SUCCESS
>> 
>> 
>> After some googling I found that this is apparently an ext3 filesystem 
>> limitation, i.e. there can be only 31998 subdirectories in a directory. Why 
>> is openmpi creating all of these directories in the first place? Is there a 
>> way to "recycle" them?
> 
> The session directories are built to house shared memory backing files, plus 
> other potential entries depending upon options. They should be deleted upon 
> finalize of each process, so you shouldn't be running out of them.
> 
> I can check to see that the code is cleaning them out (or at least, 
> attempting to do so). Not sure if there is something about ext3 that might 
> retain the directory entries until the "parent" process terminates, even 
> though the files have been deleted.
> 
> If you do an ls on the directory tree, do you see 32k subdirectories? Or do 
> you only see the ones for the active processes?
> 
> That's a good point. As the master process is running I can see the directory 
> fill up. When I Ctrl-C the master, the directory completely disappears. When 
> I let it run all the way to 32K directories, the directory does not disappear 
> and contains 32K directories even after master gets killed by MPI.
> 
> Some process must not be closing some file in these directories which would 
> prevent them from being unlinked, if I understand ext3 correctly.
> 
> nick
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to