Ah - not good. It is clearly a programming error. I'll have to review the other launchers and consult the others in the project to decide on the proper course of action.
Thanks On Nov 17, 2009, at 1:49 PM, David Singleton wrote: > > Hi Ralph, > > Now I'm in a quandry - if I show you that its actually Open MPI that is > propagating the environment then you are likely to "fix it" and then tm > users will lose a nice feature. :-) > > Can I suggest that "least surprise" would require that MPI tasks get > exactly the same environment/limits/... as mpirun so that "mpirun a.out" > behaves just like "a.out". [Following this principle we modified > tm_spawn to propagate the callers rlimits to the spawned tasks.] > A comment in orterun.c (see below) below suggests that Open MPI is trying > to distinguish between "local" and "remote" processes. I would have > thought that distinction should be invisible to users as much as possible > - a user asking for 4 cpus would like to see the same behaviour if all > 4 are local or "2 local, 2 remote". > > As to why tm does "The Right Thing": in the case of rsh/ssh the full > mpirun environment is given to the rsh/ssh process locally while in the tm > case it is an argument to tm_spawn and so gets given to the process (in > this case orted) being launched remotely. Relevant lines from 1.3.3 below. > PBS just passes along the environment it is told to. We dont use torque > but as of 2.3.3, it was still the same as OpenPBS in this respect. > > Michael just pointed out the slight flaw. The environment should be > somewhat selectively propagated (exclude HOSTNAME etc). I guess if you > were to "fix" plm_tm_module I would put the propagation behaviour in > tm_spawn and try to handle these exceptional cases. > > Cheers, > David > > > orterun.c: > > 510 /* save the environment for launch purposes. This MUST be > 511 * done so that we can pass it to any local procs we > 512 * spawn - otherwise, those local procs won't see any > 513 * non-MCA envars were set in the enviro prior to calling > 514 * orterun > 515 */ > 516 orte_launch_environ = opal_argv_copy(environ); > > > plm_rsh_module.c: > > 681 /* actually ssh the child */ > 682 static void ssh_child(int argc, char **argv, > 683 orte_vpid_t vpid, int proc_vpid_index) > 684 { > > 694 /* setup environment */ > 695 env = opal_argv_copy(orte_launch_environ); > > 766 execve(exec_path, exec_argv, env); > > > plm_tm_module.c: > > 128 static int plm_tm_launch_job(orte_job_t *jdata) > 129 { > > 228 /* setup environment */ > 229 env = opal_argv_copy(orte_launch_environ); > > 311 rc = tm_spawn(argc, argv, env, node->launch_id, tm_task_ids + > launched, tm_events + launched); > > > > Ralph Castain wrote: >> Not exactly. It completely depends on how Torque was setup - OMPI isn't >> forwarding the environment. Torque is. >> We made a design decision at the very beginning of the OMPI project not to >> forward non-OMPI envars unless directed to do so by the user. I'm afraid I >> disagree with Michael's claim that other MPIs do forward them - yes, MPICH >> does, but not all others do. >> The world is bigger than MPICH and OMPI :-) >> Since there is inconsistency in this regard between MPIs, we chose not to >> forward. Reason was simple: there is no way to know what is safe to forward >> vs what is not (e.g., what to do with DISPLAY), nor what the underlying >> environment is trying to forward vs what it isn't. It is very easy to get >> cross-wise and cause totally unexpected behavior, as users have complained >> about for years. >> First, if you are using a managed environment like Torque, we recommend that >> you work with your sys admin to decide how to configure it. This is the best >> way to resolve a problem. >> Second, if you are not using a managed environment and/or decide not to have >> that environment do the forwarding, you can tell OMPI to forward the envars >> you need by specifying them via the -x cmd line option. We already have a >> request to expand this capability, and I will be doing so as time permits. >> One option I'll be adding is the reverse of -x - i.e., "forward all envars >> -except- the specified one(s)". >> HTH >> ralph