We do assume that Open MPI is installed in the same location on all
execute nodes, and we set that by passing --prefix $OPEN_MPI_DIR to
mpirun. The ssh wrapper script still tells ssh to execute the PATH,
LD_LIBRARY_PATH, etc. definitions that mpirun feeds it. However, the
location of the mpicc-compiled executable varies by node/slot. It
seems that the head node process tells the worker node processes to
look for this executable in a static location, and if the worker node
processes don't find it there, they look for it in $HOME (*not* the
current working directory).

On Mon, Nov 28, 2016 at 10:57 AM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com> wrote:
> I'm not sure I understand your solution -- it sounds like you are overriding 
> $HOME for each process...?  If so, that's playing with fire.
>
> Is there a reason you can't set PATH / LD_LIBRARY_PATH in your ssh wrapper 
> script to point to the Open MPI installation that you want to use on each 
> node?
>
> To answer your question: yes, "rsh agent" MCA param has changed over time.  
> It's been plm_rsh_agent for a while, though.  I don't remember exactly when 
> it changed, but it's been that way since at least v1.8.0.
>
>
>> On Nov 23, 2016, at 5:04 PM, Jason Patton <jpat...@cs.wisc.edu> wrote:
>>
>> I think I may have solved this, in case anyone is curious or wants to
>> yell about how terrible it is :). In the ssh wrapper script, when
>> ssh-ing, before launching orted:
>>
>> export HOME=${your_working_directory} \;
>>
>> (If $HOME means something for you jobs, then maybe this isn't a good 
>> solution.)
>>
>> Got this from connecting some dots from the man page:
>>
>> Under Current Working Directory (emphasis added):
>>
>> "If the -wdir option is not specified, Open MPI will send the
>> directory name where mpirun was invoked to each of the remote nodes.
>> The remote nodes will try to change to that directory. If they are
>> unable (e.g., if the directory does not exist on that node), then
>> **Open MPI will use the default directory determined by the
>> starter**."
>>
>> In this case the starter is ssh; under Locating Files:
>>
>> "For example when using the rsh or ssh starters, **the initial
>> directory is $HOME by default**."
>>
>> Hope this helps someone!
>>
>> Jason Patton
>>
>> On Wed, Nov 23, 2016 at 1:43 PM, Jason Patton <jpat...@cs.wisc.edu> wrote:
>>> I would like to mpirun across nodes that do not share a filesystem and
>>> might have the executable in different directories. For example, node0
>>> has the executable at /tmp/job42/mpitest and node1 has it at
>>> /tmp/job100/mpitest.
>>>
>>> If you can grant me that I have a ssh wrapper script (that gets set as
>>> the orte/plm_rsh_agent**) that cds to where the executable lies on
>>> each worker node before launching orted, is there a way to tell the
>>> worker node orted processes to run the executable from the current
>>> working directory rather than from the absolute path that (I presume)
>>> the head node process advertises? I've tried adding/changing
>>> orte_remote_tmpdir_base per each worker orted process, but then I get
>>> an error about having both global_tmpdir and remote_tmpdir set. Then
>>> if I set local_tmpdir to match the head node, I'm back at square one.
>>>
>>> I know this sounds fairly convoluted, but I'm updating helper scripts
>>> for HTCondor so that its parallel universe can work with newer MPI
>>> versions (dealing with similar headaches trying to get hydra to
>>> cooperate). The default behavior is for condor to place each "job"
>>> (i.e. sshd+orted process) in a sandbox, and we cannot know the name of
>>> the sandbox directories ahead of time or assume that they will have
>>> the same name across nodes. The easiest way to deal with this is if we
>>> can assume the executable lies on a shared fs, but the fewer
>>> assumptions from our POV the better. (Even better would be if someone
>>> /really/ wants to build in condor support like has been done for other
>>> launchers; that's beyond me right now.)
>>>
>>> **Also, what is the correct parameter to set to rsh_agent? ompi_info
>>> (and mpirun) says orte_rsh_agent is deprecated, but online docs seem
>>> to suggest that plm_rsh_agent is deprecated. I'm using version 1.8.1.
>>>
>>> Thanks for any insight you can provide
>>>
>>> Jason Patton
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to