Gah.  I didn't realize that my 1.4.x build was a *developer* build.  
*Developer* builds give a *lot* more detail with plm_base_verbose=100 
(including the specific rsh command being used).  You obviously didn't get that 
output because you don't have a developer build.  :-\

Just for reference, here's what plm_base_verbose=100 tells me for running an 
orted on a remote node, when I use the --prefix option to mpirun (I'm a tcsh 
user, so the syntax below will be a little different than what is running in 
your environment):

-----
[svbu-mpi:28527] [[20181,0],0] plm:rsh: executing: (//usr/bin/ssh) 
[/usr/bin/ssh svbu-mpi001  set path = ( /home/jsquyres/bogus/bin $path ) ; if ( 
$?LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) 
setenv LD_LIBRARY_PATH /home/jsquyres/bogus/lib ; if ( $?OMPI_have_llp == 1 ) 
setenv LD_LIBRARY_PATH /home/jsquyres/bogus/lib:$LD_LIBRARY_PATH ;  
/home/jsquyres/bogus/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 
1322582016 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri 
"1322582016.0;tcp://172.29.218.140:34815;tcp://10.148.255.1:34815" --mca 
plm_base_verbose 100]
-----

Ok, a few options here:

1. You can get a developer build if you use the --enable-debug option to 
configure.  Then plm_base_verbose=100 will give a lot more info.  Remember, the 
goal here is to see what's going wrong -- not to depend on having a developer 
build around.

2. If that isn't workable, make an "orted" in your default path somewhere 
that's a short script:

-----
:
echo ===========environment===========
env | sort
echo ===========environment end===========
sleep 10000000
-----

Then when you "mpirun", do a "ps" to see exactly what was executed on the node 
where mpirun was invoked and the node where orted is supposed to be running.  
It's not quite as descriptive as seeing the plm_base_verbose output because we 
run multiple shell commands, but it's something.  You'll also see the stdout 
from the local node.  You'll need to use the --leave-session-attached option to 
mpirun to see the output from the remote nodes.


On Feb 29, 2012, at 9:43 AM, Yiguang Yan wrote:

> Hi Jeff,
> 
> Thanks.
> 
> I tried as what you suggested. Here are the output:
> 
>>>> 
> yiguang@gulftown testdmp]$ ./test.bash
> [gulftown:25052] mca: base: components_open: Looking for plm 
> components
> [gulftown:25052] mca: base: components_open: opening plm 
> components
> [gulftown:25052] mca: base: components_open: found loaded 
> component rsh
> [gulftown:25052] mca: base: components_open: component rsh 
> has no register function
> [gulftown:25052] mca: base: components_open: component rsh 
> open function successful
> [gulftown:25052] mca: base: components_open: found loaded 
> component slurm
> [gulftown:25052] mca: base: components_open: component slurm 
> has no register function
> [gulftown:25052] mca: base: components_open: component slurm 
> open function successful
> [gulftown:25052] mca: base: components_open: found loaded 
> component tm
> [gulftown:25052] mca: base: components_open: component tm 
> has no register function
> [gulftown:25052] mca: base: components_open: component tm 
> open function successful
> [gulftown:25052] mca:base:select: Auto-selecting plm components
> [gulftown:25052] mca:base:select:(  plm) Querying component [rsh]
> [gulftown:25052] mca:base:select:(  plm) Query of component [rsh] 
> set priority to 10
> [gulftown:25052] mca:base:select:(  plm) Querying component 
> [slurm]
> [gulftown:25052] mca:base:select:(  plm) Skipping component 
> [slurm]. Query failed to return a module
> [gulftown:25052] mca:base:select:(  plm) Querying component [tm]
> [gulftown:25052] mca:base:select:(  plm) Skipping component [tm]. 
> Query failed to return a module
> [gulftown:25052] mca:base:select:(  plm) Selected component [rsh]
> [gulftown:25052] mca: base: close: component slurm closed
> [gulftown:25052] mca: base: close: unloading component slurm
> [gulftown:25052] mca: base: close: component tm closed
> [gulftown:25052] mca: base: close: unloading component tm
> bash: orted: command not found
> bash: orted: command not found
> bash: orted: command not found
> <<<
> 
> 
> The following is the content of test.bash:
>>>> 
> yiguang@gulftown testdmp]$ ./test.bash
> #!/bin/sh -f
> #nohup
> #
> # 
> >-----------------------------------------------------------------------------------
> --------<
> adinahome=/usr/adina/system8.8dmp
> mpirunfile=$adinahome/bin/mpirun
> #
> # Set envars for mpirun and orted
> #
> export PATH=$adinahome/bin:$adinahome/tools:$PATH
> export LD_LIBRARY_PATH=$adinahome/lib:$LD_LIBRARY_PATH
> #
> #
> # run DMP problem
> #
> mcaprefix="--prefix $adinahome"
> mcarshagent="--mca plm_rsh_agent rsh:ssh"
> mcatmpdir="--mca orte_tmpdir_base /tmp"
> mcaopenibmsg="--mca btl_openib_warn_default_gid_prefix 0"
> mcaenvars="-x PATH -x LD_LIBRARY_PATH"
> mcabtlconn="--mca btl openib,sm,self"
> mcaplmbase="--mca plm_base_verbose 100"
> 
> mcaparams="$mcaprefix $mcaenvars $mcarshagent 
> $mcaopenibmsg $mcabtlconn $mcatmpdir $mcaplmbase"
> 
> $mpirunfile $mcaparams --app addmpw-hostname
> <<<
> 
> While the content of addmpw-hostname is:
>>>> 
> -n 1 -host gulftown hostname
> -n 1 -host ibnode001 hostname
> -n 1 -host ibnode002 hostname
> -n 1 -host ibnode003 thostname
> <<<
> 
> After this, I also tried to specify the orted through:
> 
> --mca orte_launch_agent $adinahome/bin/orted
> 
> then, orted could be found on slave nodes, but now the shared libs 
> in $adinahome/lib are not on the LD_LIBRARY_PATH.
> 
> Any comments?
> 
> Thanks,
> Yiguang
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to