Hi All,

We have bundled Open MPI with our product and shipped it to the customer. According to http://www.open-mpi.org/faq/?category=building#installdirs ,

Below is the command we used to launch MPI program:
env OPAL_PREFIX=/path/to/openmpi \
/path/to/openmpi/bin//orterun --prefix /path/to/openmpi -x PATH -x LD_LIBRARY_PATH -x OPAL_PREFIX -np 2 --host host1,host2 ring_c

The interesting fact is that it always works on csh/tcsh. But quite a few users told us that they runs into below errors:

[compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init_stage1.c at line 182
------------------------------------------------------------------------
--
Sorry!  You were supposed to get help about:
  orte_init:startup:internal-failure
from the file:
  help-orte-runtime
But I couldn't find any file matching that name.  Sorry!
------------------------------------------------------------------------
--
[compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_system_init.c at line 42
[compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line 52
------------------------------------------------------------------------
--
Sorry!  You were supposed to get help about:
  orted:init-failure
from the file:
  help-orted.txt
But I couldn't find any file matching that name.  Sorry!


Jeff did mention in http://www.open-mpi.org/community/lists/users/2008/09/6582.php that OPAL_PREFIX was propagated for him automatically. I bet Jeff uses csh/tcsh.
Anyway, it can be traced back to how the daemon is launched.

sh/bash:

[xxxxx:25369] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh xxxxx
OPAL_PREFIX=/opt/openmpi-1.2.4 ;
PATH=/opt/openmpi-1.2.4/bin:$PATH
; export PATH ;
LD_LIBRARY_PATH=/opt/openmpi-1.2.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;

csh/tcsh:
[xxxxx:09886] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh xxxxx
setenv OPAL_PREFIX /opt/openmpi-1.2.4 ;


It seems to work after I patched pls_rsh_module.c


--- pls_rsh_module.c.orig       2008-10-16 17:15:32.000000000 -0400
+++ pls_rsh_module.c    2008-10-16 17:15:51.000000000 -0400
@@ -989,7 +989,7 @@
                                  "%s/%s/%s",
(opal_prefix != NULL ? "OPAL_PREFIX=" : ""), (opal_prefix != NULL ? opal_prefix : ""),
-                                  (opal_prefix != NULL ? " ;" : ""),
+ (opal_prefix != NULL ? " ; export OPAL_PREFIX ; " : ""),
                                  prefix_dir, bin_base,
                                  prefix_dir, lib_base,
                                  prefix_dir, bin_base,

Another workaround is to add
export OPAL_PREFIX
into $HOME/.bashrc.

Jeff, is this a bug in the code? Or there is a reason that OPAL_PREFIX is not exported for sh/bash?

Teng

Reply via email to