Thanks for the quick reply.  This leads me to another issue I have
been having with openmpi as it relates to sge.  The "tight
integration" works where I do not have to give mpirun a hostfile when
I use the scheduler, but it does not seem to be passing on my
environment variables.  Specifically because I used intel compilers to
compile openmpi, I have to be sure to set the LD_LIBRARY_PATH
correctly in my job submission script or openmpi will not run (giving
the error discussed in the FAQ).  Where I am a little lost is whether
this is a problem with the way I built openmpi or whether it is a
configuration problem with sge.

This may be unrelated to my previous problem, but the similarities
with the environment variables made me think of it.

Thanks for your consideration,
Luke Shulenburger
Geophysical Laboratory
Carnegie Institution of Washington

On Wed, Oct 28, 2009 at 3:48 PM, Ralph Castain <r...@open-mpi.org> wrote:
> I'm afraid we have never really supported this kind of nested invocations of
> mpirun. If it works with any version of OMPI, it is totally a fluke - it
> might work one time, and then fail the next.
>
> The problem is that we pass envars to the launched processes to control
> their behavior, and these conflict with what mpirun needs. We have tried
> various scrubbing mechanisms (i.e., having mpirun start out by scrubbing the
> environment of envars that would have come from the initial mpirun, but they
> all have the unfortunate possibility of removing parameters provided by the
> user - and that can cause its own problems.
>
> I don't know if we will ever support nested operations - occasionally, I do
> give it some thought, but have yet to find a foolproof solution.
>
> Ralph
>
>
> On Wed, Oct 28, 2009 at 1:11 PM, Luke Shulenburger <lshulenbur...@gmail.com>
> wrote:
>>
>> Hello,
>> I am having trouble with a script that calls mpi.  Basically my
>> problem distills to wanting to call a script with:
>>
>> mpirun -np # ./script.sh
>>
>> where script.sh looks like:
>> #!/bin/bash
>> mpirun -np 2 ./mpiprogram
>>
>> Whenever I invoke script.sh normally (as ./script.sh for instance) it
>> works fine, but if I do mpirun -np 2 ./script.sh I get the following
>> error:
>>
>> [ppv.stanford.edu:08814] [[27860,1],0] ORTE_ERROR_LOG: A message is
>> attempting to be sent to a process whose contact information is
>> unknown in file rml_oob_send.c at line 105
>> [ppv.stanford.edu:08814] [[27860,1],0] could not get route to
>> [[INVALID],INVALID]
>> [ppv.stanford.edu:08814] [[27860,1],0] ORTE_ERROR_LOG: A message is
>> attempting to be sent to a process whose contact information is
>> unknown in file base/plm_base_proxy.c at line 86
>>
>> I have also tried running with mpirun -d to get some debugging info
>> and it appears that the proctable is not being created for the second
>> mpirun.  The command hangs like so:
>>
>> [ppv.stanford.edu:08823] procdir:
>> /tmp/openmpi-sessions-sluke@ppv.stanford.edu_0/27855/0/0
>> [ppv.stanford.edu:08823] jobdir:
>> /tmp/openmpi-sessions-sluke@ppv.stanford.edu_0/27855/0
>> [ppv.stanford.edu:08823] top: openmpi-sessions-sluke@ppv.stanford.edu_0
>> [ppv.stanford.edu:08823] tmp: /tmp
>> [ppv.stanford.edu:08823] [[27855,0],0] node[0].name ppv daemon 0 arch
>> ffc91200
>> [ppv.stanford.edu:08823] Info: Setting up debugger process table for
>> applications
>>  MPIR_being_debugged = 0
>>  MPIR_debug_state = 1
>>  MPIR_partial_attach_ok = 1
>>  MPIR_i_am_starter = 0
>>  MPIR_proctable_size = 1
>>  MPIR_proctable:
>>    (i, host, exe, pid) = (0, ppv.stanford.edu,
>> /home/sluke/maintenance/openmpi-1.3.3/examples/./shell.sh, 8824)
>> [ppv.stanford.edu:08825] procdir:
>> /tmp/openmpi-sessions-sluke@ppv.stanford.edu_0/27855/1/0
>> [ppv.stanford.edu:08825] jobdir:
>> /tmp/openmpi-sessions-sluke@ppv.stanford.edu_0/27855/1
>> [ppv.stanford.edu:08825] top: openmpi-sessions-sluke@ppv.stanford.edu_0
>> [ppv.stanford.edu:08825] tmp: /tmp
>> [ppv.stanford.edu:08825] [[27855,1],0] ORTE_ERROR_LOG: A message is
>> attempting to be sent to a process whose contact information is
>> unknown in file rml_oob_send.c at line 105
>> [ppv.stanford.edu:08825] [[27855,1],0] could not get route to
>> [[INVALID],INVALID]
>> [ppv.stanford.edu:08825] [[27855,1],0] ORTE_ERROR_LOG: A message is
>> attempting to be sent to a process whose contact information is
>> unknown in file base/plm_base_proxy.c at line 86
>> [ppv.stanford.edu:08825] Info: Setting up debugger process table for
>> applications
>>  MPIR_being_debugged = 0
>>  MPIR_debug_state = 1
>>  MPIR_partial_attach_ok = 1
>>  MPIR_i_am_starter = 0
>>  MPIR_proctable_size = 0
>>  MPIR_proctable:
>>
>>
>> In this case, it does not matter what the ultimate mpiprogram I try to
>> run is, the shell script fails in the same way regardless (I've tried
>> the hello_f90 executable from the openmpi examples directory).  Here
>> are some details of my setup:
>>
>> I have built openmpi 1.3.3 with the intel fortran in c compilers
>> (version 11.1).  The machine uses rocks with the SGE scheduler, so I
>> have run autoconf with ./configure --prefix=/home/sluke --with-sge,
>> however this problem persists even if I am running on the head node
>> outside of the scheduler.  I am attaching the resulting config.log to
>> this email as well as output to ompi_info --all and ifconfig.  I hope
>> this gives the experts on the list enough to go from, but I will be
>> happy to provide any more information that might be helpful.
>>
>> Luke Shulenburger
>> Geophysical Laboratory
>> Carnegie Institution of Washington
>>
>>
>> PS I have tried this on a machine with openmpi-1.2.6 and cannot
>> reproduce the error, however on a second machine with openmpi-1.3.2 I
>> have the same problem.
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to