Passing --wdir to mpirun does not solve this particular case, I
believe. HTCondor sets up each worker slot with a uniquely named
sandbox, e.g. a 2-process job might have the user's executable copied
to /var/lib/condor/execute/dir_11955 on one machine and
/var/lib/condor/execute/dir_3484 on another machine. Unless there's a
shared filesystem where both machines can access the executable from a
common location (which we don't want to assume), --wdir won't help. We
need to set the working directory per worker process.

If --wdir were an option for the worker orted processes, then we could
implement that in the ssh wrapper script. Right now, it seems like the
best option is to set $HOME to the sandbox directory for when the
initial chdir fails.

On Mon, Nov 28, 2016 at 11:36 AM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com> wrote:
> On Nov 28, 2016, at 12:16 PM, Jason Patton <jpat...@cs.wisc.edu> wrote:
>>
>> We do assume that Open MPI is installed in the same location on all
>> execute nodes, and we set that by passing --prefix $OPEN_MPI_DIR to
>> mpirun. The ssh wrapper script still tells ssh to execute the PATH,
>> LD_LIBRARY_PATH, etc. definitions that mpirun feeds it. However, the
>> location of the mpicc-compiled executable varies by node/slot.
>
> Ah -- you're trying to set the location of the user's executable, not the 
> Open MPI helpers/libraries/etc.  I missed that.
>
> In that case, --wdir should do what you want (even on a per app basis).  Let 
> us know if it does not.
>
>> It
>> seems that the head node process tells the worker node processes to
>> look for this executable in a static location, and if the worker node
>> processes don't find it there, they look for it in $HOME (*not* the
>> current working directory).
>
> IIRC, Open MPI will first try to chdir to the working dir of where mpirun was 
> launched.  If it can't (e.g., if that dir does not exist), it will just 
> chdir($HOME).
>
> If you specify --wdir, I believe it will try to chdir to that dir.  If that 
> dir does not exist, or it otherwise fails to chdir there, it should fail / 
> kill your job (on the rationale that you explicitly asked for something that 
> Open MPI couldn't do, vs. the implicit chdir to the working dir of mpirun).
>
>
>> On Mon, Nov 28, 2016 at 10:57 AM, Jeff Squyres (jsquyres)
>> <jsquy...@cisco.com> wrote:
>>> I'm not sure I understand your solution -- it sounds like you are 
>>> overriding $HOME for each process...?  If so, that's playing with fire.
>>>
>>> Is there a reason you can't set PATH / LD_LIBRARY_PATH in your ssh wrapper 
>>> script to point to the Open MPI installation that you want to use on each 
>>> node?
>>>
>>> To answer your question: yes, "rsh agent" MCA param has changed over time.  
>>> It's been plm_rsh_agent for a while, though.  I don't remember exactly when 
>>> it changed, but it's been that way since at least v1.8.0.
>>>
>>>
>>>> On Nov 23, 2016, at 5:04 PM, Jason Patton <jpat...@cs.wisc.edu> wrote:
>>>>
>>>> I think I may have solved this, in case anyone is curious or wants to
>>>> yell about how terrible it is :). In the ssh wrapper script, when
>>>> ssh-ing, before launching orted:
>>>>
>>>> export HOME=${your_working_directory} \;
>>>>
>>>> (If $HOME means something for you jobs, then maybe this isn't a good 
>>>> solution.)
>>>>
>>>> Got this from connecting some dots from the man page:
>>>>
>>>> Under Current Working Directory (emphasis added):
>>>>
>>>> "If the -wdir option is not specified, Open MPI will send the
>>>> directory name where mpirun was invoked to each of the remote nodes.
>>>> The remote nodes will try to change to that directory. If they are
>>>> unable (e.g., if the directory does not exist on that node), then
>>>> **Open MPI will use the default directory determined by the
>>>> starter**."
>>>>
>>>> In this case the starter is ssh; under Locating Files:
>>>>
>>>> "For example when using the rsh or ssh starters, **the initial
>>>> directory is $HOME by default**."
>>>>
>>>> Hope this helps someone!
>>>>
>>>> Jason Patton
>>>>
>>>> On Wed, Nov 23, 2016 at 1:43 PM, Jason Patton <jpat...@cs.wisc.edu> wrote:
>>>>> I would like to mpirun across nodes that do not share a filesystem and
>>>>> might have the executable in different directories. For example, node0
>>>>> has the executable at /tmp/job42/mpitest and node1 has it at
>>>>> /tmp/job100/mpitest.
>>>>>
>>>>> If you can grant me that I have a ssh wrapper script (that gets set as
>>>>> the orte/plm_rsh_agent**) that cds to where the executable lies on
>>>>> each worker node before launching orted, is there a way to tell the
>>>>> worker node orted processes to run the executable from the current
>>>>> working directory rather than from the absolute path that (I presume)
>>>>> the head node process advertises? I've tried adding/changing
>>>>> orte_remote_tmpdir_base per each worker orted process, but then I get
>>>>> an error about having both global_tmpdir and remote_tmpdir set. Then
>>>>> if I set local_tmpdir to match the head node, I'm back at square one.
>>>>>
>>>>> I know this sounds fairly convoluted, but I'm updating helper scripts
>>>>> for HTCondor so that its parallel universe can work with newer MPI
>>>>> versions (dealing with similar headaches trying to get hydra to
>>>>> cooperate). The default behavior is for condor to place each "job"
>>>>> (i.e. sshd+orted process) in a sandbox, and we cannot know the name of
>>>>> the sandbox directories ahead of time or assume that they will have
>>>>> the same name across nodes. The easiest way to deal with this is if we
>>>>> can assume the executable lies on a shared fs, but the fewer
>>>>> assumptions from our POV the better. (Even better would be if someone
>>>>> /really/ wants to build in condor support like has been done for other
>>>>> launchers; that's beyond me right now.)
>>>>>
>>>>> **Also, what is the correct parameter to set to rsh_agent? ompi_info
>>>>> (and mpirun) says orte_rsh_agent is deprecated, but online docs seem
>>>>> to suggest that plm_rsh_agent is deprecated. I'm using version 1.8.1.
>>>>>
>>>>> Thanks for any insight you can provide
>>>>>
>>>>> Jason Patton
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to