We do assume that Open MPI is installed in the same location on all execute nodes, and we set that by passing --prefix $OPEN_MPI_DIR to mpirun. The ssh wrapper script still tells ssh to execute the PATH, LD_LIBRARY_PATH, etc. definitions that mpirun feeds it. However, the location of the mpicc-compiled executable varies by node/slot. It seems that the head node process tells the worker node processes to look for this executable in a static location, and if the worker node processes don't find it there, they look for it in $HOME (*not* the current working directory).
On Mon, Nov 28, 2016 at 10:57 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > I'm not sure I understand your solution -- it sounds like you are overriding > $HOME for each process...? If so, that's playing with fire. > > Is there a reason you can't set PATH / LD_LIBRARY_PATH in your ssh wrapper > script to point to the Open MPI installation that you want to use on each > node? > > To answer your question: yes, "rsh agent" MCA param has changed over time. > It's been plm_rsh_agent for a while, though. I don't remember exactly when > it changed, but it's been that way since at least v1.8.0. > > >> On Nov 23, 2016, at 5:04 PM, Jason Patton <jpat...@cs.wisc.edu> wrote: >> >> I think I may have solved this, in case anyone is curious or wants to >> yell about how terrible it is :). In the ssh wrapper script, when >> ssh-ing, before launching orted: >> >> export HOME=${your_working_directory} \; >> >> (If $HOME means something for you jobs, then maybe this isn't a good >> solution.) >> >> Got this from connecting some dots from the man page: >> >> Under Current Working Directory (emphasis added): >> >> "If the -wdir option is not specified, Open MPI will send the >> directory name where mpirun was invoked to each of the remote nodes. >> The remote nodes will try to change to that directory. If they are >> unable (e.g., if the directory does not exist on that node), then >> **Open MPI will use the default directory determined by the >> starter**." >> >> In this case the starter is ssh; under Locating Files: >> >> "For example when using the rsh or ssh starters, **the initial >> directory is $HOME by default**." >> >> Hope this helps someone! >> >> Jason Patton >> >> On Wed, Nov 23, 2016 at 1:43 PM, Jason Patton <jpat...@cs.wisc.edu> wrote: >>> I would like to mpirun across nodes that do not share a filesystem and >>> might have the executable in different directories. For example, node0 >>> has the executable at /tmp/job42/mpitest and node1 has it at >>> /tmp/job100/mpitest. >>> >>> If you can grant me that I have a ssh wrapper script (that gets set as >>> the orte/plm_rsh_agent**) that cds to where the executable lies on >>> each worker node before launching orted, is there a way to tell the >>> worker node orted processes to run the executable from the current >>> working directory rather than from the absolute path that (I presume) >>> the head node process advertises? I've tried adding/changing >>> orte_remote_tmpdir_base per each worker orted process, but then I get >>> an error about having both global_tmpdir and remote_tmpdir set. Then >>> if I set local_tmpdir to match the head node, I'm back at square one. >>> >>> I know this sounds fairly convoluted, but I'm updating helper scripts >>> for HTCondor so that its parallel universe can work with newer MPI >>> versions (dealing with similar headaches trying to get hydra to >>> cooperate). The default behavior is for condor to place each "job" >>> (i.e. sshd+orted process) in a sandbox, and we cannot know the name of >>> the sandbox directories ahead of time or assume that they will have >>> the same name across nodes. The easiest way to deal with this is if we >>> can assume the executable lies on a shared fs, but the fewer >>> assumptions from our POV the better. (Even better would be if someone >>> /really/ wants to build in condor support like has been done for other >>> launchers; that's beyond me right now.) >>> >>> **Also, what is the correct parameter to set to rsh_agent? ompi_info >>> (and mpirun) says orte_rsh_agent is deprecated, but online docs seem >>> to suggest that plm_rsh_agent is deprecated. I'm using version 1.8.1. >>> >>> Thanks for any insight you can provide >>> >>> Jason Patton >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users