Passing --wdir to mpirun does not solve this particular case, I believe. HTCondor sets up each worker slot with a uniquely named sandbox, e.g. a 2-process job might have the user's executable copied to /var/lib/condor/execute/dir_11955 on one machine and /var/lib/condor/execute/dir_3484 on another machine. Unless there's a shared filesystem where both machines can access the executable from a common location (which we don't want to assume), --wdir won't help. We need to set the working directory per worker process.
If --wdir were an option for the worker orted processes, then we could implement that in the ssh wrapper script. Right now, it seems like the best option is to set $HOME to the sandbox directory for when the initial chdir fails. On Mon, Nov 28, 2016 at 11:36 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > On Nov 28, 2016, at 12:16 PM, Jason Patton <jpat...@cs.wisc.edu> wrote: >> >> We do assume that Open MPI is installed in the same location on all >> execute nodes, and we set that by passing --prefix $OPEN_MPI_DIR to >> mpirun. The ssh wrapper script still tells ssh to execute the PATH, >> LD_LIBRARY_PATH, etc. definitions that mpirun feeds it. However, the >> location of the mpicc-compiled executable varies by node/slot. > > Ah -- you're trying to set the location of the user's executable, not the > Open MPI helpers/libraries/etc. I missed that. > > In that case, --wdir should do what you want (even on a per app basis). Let > us know if it does not. > >> It >> seems that the head node process tells the worker node processes to >> look for this executable in a static location, and if the worker node >> processes don't find it there, they look for it in $HOME (*not* the >> current working directory). > > IIRC, Open MPI will first try to chdir to the working dir of where mpirun was > launched. If it can't (e.g., if that dir does not exist), it will just > chdir($HOME). > > If you specify --wdir, I believe it will try to chdir to that dir. If that > dir does not exist, or it otherwise fails to chdir there, it should fail / > kill your job (on the rationale that you explicitly asked for something that > Open MPI couldn't do, vs. the implicit chdir to the working dir of mpirun). > > >> On Mon, Nov 28, 2016 at 10:57 AM, Jeff Squyres (jsquyres) >> <jsquy...@cisco.com> wrote: >>> I'm not sure I understand your solution -- it sounds like you are >>> overriding $HOME for each process...? If so, that's playing with fire. >>> >>> Is there a reason you can't set PATH / LD_LIBRARY_PATH in your ssh wrapper >>> script to point to the Open MPI installation that you want to use on each >>> node? >>> >>> To answer your question: yes, "rsh agent" MCA param has changed over time. >>> It's been plm_rsh_agent for a while, though. I don't remember exactly when >>> it changed, but it's been that way since at least v1.8.0. >>> >>> >>>> On Nov 23, 2016, at 5:04 PM, Jason Patton <jpat...@cs.wisc.edu> wrote: >>>> >>>> I think I may have solved this, in case anyone is curious or wants to >>>> yell about how terrible it is :). In the ssh wrapper script, when >>>> ssh-ing, before launching orted: >>>> >>>> export HOME=${your_working_directory} \; >>>> >>>> (If $HOME means something for you jobs, then maybe this isn't a good >>>> solution.) >>>> >>>> Got this from connecting some dots from the man page: >>>> >>>> Under Current Working Directory (emphasis added): >>>> >>>> "If the -wdir option is not specified, Open MPI will send the >>>> directory name where mpirun was invoked to each of the remote nodes. >>>> The remote nodes will try to change to that directory. If they are >>>> unable (e.g., if the directory does not exist on that node), then >>>> **Open MPI will use the default directory determined by the >>>> starter**." >>>> >>>> In this case the starter is ssh; under Locating Files: >>>> >>>> "For example when using the rsh or ssh starters, **the initial >>>> directory is $HOME by default**." >>>> >>>> Hope this helps someone! >>>> >>>> Jason Patton >>>> >>>> On Wed, Nov 23, 2016 at 1:43 PM, Jason Patton <jpat...@cs.wisc.edu> wrote: >>>>> I would like to mpirun across nodes that do not share a filesystem and >>>>> might have the executable in different directories. For example, node0 >>>>> has the executable at /tmp/job42/mpitest and node1 has it at >>>>> /tmp/job100/mpitest. >>>>> >>>>> If you can grant me that I have a ssh wrapper script (that gets set as >>>>> the orte/plm_rsh_agent**) that cds to where the executable lies on >>>>> each worker node before launching orted, is there a way to tell the >>>>> worker node orted processes to run the executable from the current >>>>> working directory rather than from the absolute path that (I presume) >>>>> the head node process advertises? I've tried adding/changing >>>>> orte_remote_tmpdir_base per each worker orted process, but then I get >>>>> an error about having both global_tmpdir and remote_tmpdir set. Then >>>>> if I set local_tmpdir to match the head node, I'm back at square one. >>>>> >>>>> I know this sounds fairly convoluted, but I'm updating helper scripts >>>>> for HTCondor so that its parallel universe can work with newer MPI >>>>> versions (dealing with similar headaches trying to get hydra to >>>>> cooperate). The default behavior is for condor to place each "job" >>>>> (i.e. sshd+orted process) in a sandbox, and we cannot know the name of >>>>> the sandbox directories ahead of time or assume that they will have >>>>> the same name across nodes. The easiest way to deal with this is if we >>>>> can assume the executable lies on a shared fs, but the fewer >>>>> assumptions from our POV the better. (Even better would be if someone >>>>> /really/ wants to build in condor support like has been done for other >>>>> launchers; that's beyond me right now.) >>>>> >>>>> **Also, what is the correct parameter to set to rsh_agent? ompi_info >>>>> (and mpirun) says orte_rsh_agent is deprecated, but online docs seem >>>>> to suggest that plm_rsh_agent is deprecated. I'm using version 1.8.1. >>>>> >>>>> Thanks for any insight you can provide >>>>> >>>>> Jason Patton >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users