Hi Gilles,

> Could you please give apply the attached patch and try again with and
> without --prefix ...

Everything works fine with your patch. Thank you very much for your help.
Even the Java problem, which I reported last Friday in a separate e-mail,
is solved with your patch. I assume that it originated from the faulty
environment as well.


tyr small_prog 110 mpiexec --prefix /usr/local/openmpi-1.8.5_64_cc \
  -np 5 --host sunpc1,linpc1,tyr,rs0 init_finalize
Hello!
Hello!
Hello!
Hello!
Hello!
tyr small_prog 111 mpiexec -np 5 --host sunpc1,linpc1,tyr,rs0 init_finalize
Hello!
Hello!
Hello!
Hello!
Hello!
tyr small_prog 112 


Kind regards and once more thank you very much

Siegmar




> it seems there was a mistake when the following commit was back ported to
> the v1.8 branch
> commit 10ff75e91c3f5dad18ea854fd0ee831b2ea066d7
> Author: Ralph Castain <r...@open-mpi.org>
> Date:   Fri Apr 17 19:35:34 2015 -0700
> 
>      Per request from Andy Rieb, add ability to pass PATH and
> LD_LIBRARY_PATH elements to ssh command
>      Per request from David Bigagli, add ability to pass ssh args
> 
>      Taken from open-mpi/ompi@12bfb27161fb2710d9b4327072776ff3333f0afc
> 
> Cheers,
> 
> Gilles
> 
> FWIW, here are the details :
> 
> the bug is in orte/mca/plm/rsh/plm_rsh_module.c:
> 
> static int setup_launch(...)
> {
> ...
>      char *lib_base=NULL, *bin_base=NULL;
> ...
>      lib_base = opal_basename(opal_install_dirs.libdir);
>      bin_base = opal_basename(opal_install_dirs.bindir);
> ...
>      if (NULL != prefix_dir) {
> ...
>          asprintf(&bin_base, "%s/%s", prefix_dir, value);
> ...
>     }
> if (NULL != lib_base || NULL != bin_base) {
> ...
>         } else if (ORTE_PLM_RSH_SHELL_TCSH == remote_shell ||
>                     ORTE_PLM_RSH_SHELL_CSH == remote_shell) {
> ...
>              (void)asprintf (&final_cmd,
>                              "%s%s%s set path = ( %s $path ) ; "
> ...
>                              "setenv LD_LIBRARY_PATH %s ; "
> ...
>                              (NULL != bin_base ? bin_base : " "),
>                              (NULL != lib_base ? lib_base : " "),
> ...
> 
> in your case, prefix_dir is NULL, so bin_base is "bin" and lib_base is
> "lib64"
> 
> 
> 
> 
> On Mon, May 18, 2015 at 2:01 AM, Siegmar Gross <
> siegmar.gr...@informatik.hs-fulda.de> wrote:
> 
> > Hi Gilles,
> >
> > > I am having some hard time reading the logs on my tablet...
> > > bottom line, did using --prefix  /usr/local/openmpi-1.8.5_64_cc fix all
> > > your issues ?
> >
> > Yes, it did and the environment is also correct when I use --prefix.
> >
> > tyr small_prog 109 mpiexec --prefix /usr/local/openmpi-1.8.5_64_cc -np 5
> > --host sunpc1,linpc1,tyr,rs0 init_finalize
> > Hello!
> > Hello!
> > Hello!
> > Hello!
> > Hello!
> > tyr small_prog 110 mpiexec -np 5 --host sunpc1,linpc1,tyr,rs0 
> > init_finalize
> > ld.so.1: ssh: fatal: relocation error: file /usr/bin/ssh: symbol
> > SUNWcry_installed: referenced symbol not found
> > ...
> >
> >
> >
> > Without --prefix the following part goes wrong as far as I can see
> > and I get the wrong environment with "bin" and "lib64".
> >
> > ...
> >
> > [tyr.informatik.hs-fulda.de:03938] [[43332,0],0] plm:base:setup_vm
> > assigning new daemon [[43332,0],2] to node linpc1
> > [tyr.informatik.hs-fulda.de:03938] [[43332,0],0] plm:base:setup_vm add
> > new daemon [[43332,0],3]
> > [tyr.informatik.hs-fulda.de:03938] [[43332,0],0] plm:base:setup_vm
> > assigning new daemon [[43332,0],3] to node rs0
> > [tyr.informatik.hs-fulda.de:03938] [[43332,0],0] plm:rsh: launching vm
> > [tyr.informatik.hs-fulda.de:03938] [[43332,0],0] plm:rsh: local shell: 2
> > (tcsh)
> > [tyr.informatik.hs-fulda.de:03938] [[43332,0],0] plm:rsh: assuming same
> > remote shell as local shell
> > [tyr.informatik.hs-fulda.de:03938] [[43332,0],0] plm:rsh: remote shell: 2
> > (tcsh)
> > [tyr.informatik.hs-fulda.de:03938] [[43332,0],0] plm:rsh: final template
> > argv:
> >         /usr/local/bin/ssh <template>     set path = ( bin $path ) ; if (
> > $?LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if (
> > $?LD_LIBRARY_PATH == 0 ) setenv LD_LIBRARY_PATH lib64 ; if
> > ( $?OMPI_have_llp == 1 ) setenv LD_LIBRARY_PATH lib64:$LD_LIBRARY_PATH ;
> > if ( $?DYLD_LIBRARY_PATH == 1 ) set OMPI_have_dllp
> > ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv DYLD_LIBRARY_PATH
> > lib64 ; if ( $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH
> > lib64:$DYLD_LIBRARY_PATH ;   orted --hnp-topo-sig
> > 2N:2S:0L3:0L2:0L1:2C:2H:sun4u -mca ess "env" -mca orte_ess_jobid
> > "2839805952" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "4"
> > -mca orte_hnp_uri
> > "2839805952.0;tcp://193.174.24.39:34971" --tree-spawn --mca
> > plm_base_verbose "100" -mca plm
> > "rsh"
> > [tyr.informatik.hs-fulda.de:03938] [[43332,0],0] plm:rsh:launch daemon 0
> > not a child of mine
> > [tyr.informatik.hs-fulda.de:03938] [[43332,0],0] plm:rsh: adding node
> > sunpc1 to launch list
> > ...
> >
> >
> > Do you know which file is responsible for the above part?
> >
> >
> > > if not, can you try to add the --hetero-nodes option to mpiexec ?
> > >
> > > just to be sure, can you please confirm your login shell is csh/tcsh on
> > all
> > > your boxes ?
> >
> > It is and must be tcsh, because otherwise our environment wouldn't work.
> >
> >
> > Kind regards
> >
> > Siegmar

Reply via email to