--- Ralph Castain <r...@lanl.gov> wrote: > Unfortunately, we don't have more debug statements internal to that > function. I'll have to create a patch for you that will add some so > we can > better understand why it is failing - will try to send it to you on > Wed.
Thank you for the patch you sent. I solved the problem. It was a head-slapper of an error. Turned out that I had forgotten -- the permissions on the filesystem override the permissions of the mount point. As I mentioned, these machines have an NFS root filesystem. In that filesystem, tmp has permissions 1777. However, when each node mounts its local temp partition to /tmp, the permissions on that filesystem are the permissions the mount point takes on. In this case, I had forgotten to apply permissions 1777 to /tmp after mounting on each machine. As a result, /tmp really did not have the appropriate permissions for mpirun to write to it as necessary. Your patch helped me figure this out. Technically, I should have been able to figure it out from the messages you'd already sent to the mailing list, but it wasn't until I saw the line in session_dir.c where the error was occurring that I realized it had to be some kind of permissions error. I've attached the new debug output below: [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 108 [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 391 [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file runtime/orte_init_stage1.c at line 626 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_session_dir failed --> Returned value -1 instead of ORTE_SUCCESS -------------------------------------------------------------------------- [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file runtime/orte_system_init.c at line 42 [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 52 Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS. Starting at line 108 of session_dir.c, is: if (ORTE_SUCCESS != (ret = opal_os_dirpath_create(directory, my_mode))) { ORTE_ERROR_LOG(ret); } Three further points: -Is there some reason ORTE can't bail out gracefully upon this error, instead of hanging like it was doing for me? -I think leaving in the extra debug logging code you sent me in the patch for future Open MPI versions would be a good idea to help troubleshoot problems like this. -It would be nice to see "--debug-daemons" added to the Troubleshooting section of the FAQ on the web site. Thank you very very much for your help Ralph and everyone else that replied. ____________________________________________________________________________________ Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. http://mobile.yahoo.com/go?refer=1GNXIC