Hooray! Glad we could help track this down - sorry it was so hard to do so.

To answer your questions:

1. Yes - ORTE should bail out gracefully. It definitely should not hang. I
will log the problem and investigate. I believe I know where the problem
lies, and it may already be fixed on our trunk, but the fix may not get into
the 1.2 family (have to see what it would entail).

2. I will definitely commit that debug code and ensure it is in future
releases.

3. I'll see if we can add something about --debug-daemons to the FAQ -
thanks for pointing out that oversight.

Thanks
Ralph



On 7/18/07 12:19 PM, "Bill Johnstone" <beejsto...@yahoo.com> wrote:

> 
> --- Ralph Castain <r...@lanl.gov> wrote:
> 
>> Unfortunately, we don't have more debug statements internal to that
>> function. I'll have to create a patch for you that will add some so
>> we can
>> better understand why it is failing - will try to send it to you on
>> Wed.
> 
> Thank you for the patch you sent.
> 
> I solved the problem.  It was a head-slapper of an error.  Turned out
> that I had forgotten -- the permissions on the filesystem override the
> permissions of the mount point.  As I mentioned, these machines have an
> NFS root filesystem.  In that filesystem, tmp has permissions 1777.
> However, when each node mounts its local temp partition to /tmp, the
> permissions on that filesystem are the permissions the mount point
> takes on.
> 
> In this case, I had forgotten to apply permissions 1777 to /tmp after
> mounting on each machine.  As a result, /tmp really did not have the
> appropriate permissions for mpirun to write to it as necessary.
> 
> Your patch helped me figure this out.  Technically, I should have been
> able to figure it out from the messages you'd already sent to the
> mailing list, but it wasn't until I saw the line in session_dir.c where
> the error was occurring that I realized it had to be some kind of
> permissions error.
> 
> I've attached the new debug output below:
> 
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> util/session_dir.c at line 108
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> util/session_dir.c at line 391
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> runtime/orte_init_stage1.c at line 626
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process
> is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_session_dir failed
>   --> Returned value -1 instead of ORTE_SUCCESS
> 
> --------------------------------------------------------------------------
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> runtime/orte_system_init.c at line 42
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 52
> Open RTE was unable to initialize properly.  The error occured while
> attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
> 
> Starting at line 108 of session_dir.c, is:
> 
> if (ORTE_SUCCESS != (ret = opal_os_dirpath_create(directory, my_mode)))
> {
>         ORTE_ERROR_LOG(ret);
> }
> 
> Three further points:
> 
> -Is there some reason ORTE can't bail out gracefully upon this error,
> instead of hanging like it was doing for me?
> 
> -I think leaving in the extra debug logging code you sent me in the
> patch for future Open MPI versions would be a good idea to help
> troubleshoot problems like this.
> 
> -It would be nice to see "--debug-daemons" added to the Troubleshooting
> section of the FAQ on the web site.
> 
> Thank you very very much for your help Ralph and everyone else that replied.
> 
> 
>        
> ______________________________________________________________________________
> ______
> Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail,
> news, photos & more.
> http://mobile.yahoo.com/go?refer=1GNXIC
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to