Am 25.07.2019 um 23:00 schrieb David Laidlaw:

> Here is most of the command output when run on a grid machine:
> 
> dblade65.dhl(101) mpiexec --version
> mpiexec (OpenRTE) 2.0.2

This is some time old. I would suggest to install a fresh one. You can even 
compile one in your home directory and install it e.g. in 
$HOME/local/openmpi-3.1.4-gcc_7.4.0-shared ( by --prefix=…intended path…) and 
then access this for all your jobs (adjust for your version of gcc). In your 
~/.bash_profile and the job script:

DEFAULT_MANPATH="$(manpath -q)"
MY_OMPI="$HOME/local/openmpi-3.1.4_gcc-7.4.0_shared"
export PATH="$MY_OMPI/bin:$PATH"
export LD_LIBRARY_PATH="$MY_OMPI/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
export MANPATH="$MY_OMPI/share/man${DEFAULT_MANPATH:+:$DEFAULT_MANPATH}"
unset MY_OMPI
unset DEFAULT_MANPATH

Essentially there is no conflict with the already installed version.


> dblade65.dhl(102) ompi_info | grep grid
>                  MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component 
> v2.0.2)
> dblade65.dhl(103) c
> denied: host "dblade65.cs.brown.edu" is neither submit nor admin host
> dblade65.dhl(104) 

On a node it’s ok this way.


> Does that suggest anything?
> 
> qconf is restricted to sysadmins, which I am not.

What error is output if you try it anyway? Usually the viewing is always 
possible.


> I would note that we are running debian stretch on the cluster machines.  On 
> some of our other (non-grid) machines, running debian buster, the output is:
> 
> cslab3d.dhl(101) mpiexec --version
> mpiexec (OpenRTE) 3.1.3
> Report bugs to http://www.open-mpi.org/community/help/
> cslab3d.dhl(102) ompi_info | grep grid
>                  MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component 
> v3.1.3)

If you compile on such a machine and intend to run it in the cluster it won't 
work, as the versions don't match. Therefore the above solution, to use a 
personal version available in your $HOME for compiling and running the 
applications.

Side note: Open MPI binds the processes to cores by default. In case more than 
one MPI job is running on a node one will have to use `mpiexec --bind-to none 
…` as otherwise all jobs on this node will use core 0 upwards.

-- Reuti


> Thanks!
> 
> -David Laidlaw
> 
> On Thu, Jul 25, 2019 at 2:13 PM Reuti <re...@staff.uni-marburg.de> wrote:
> 
> Am 25.07.2019 um 18:59 schrieb David Laidlaw via users:
> 
> > I have been trying to run some MPI jobs under SGE for almost a year without 
> > success.  What seems like a very simple test program fails; the ingredients 
> > of it are below.  Any suggestions on any piece of the test, reasons for 
> > failure, requests for additional info, configuration thoughts, etc. would 
> > be much appreciated.  I suspect the linkage between SGIEand MPI, but can't 
> > identify the problem.  We do have SGE support build into MPI.  We also have 
> > the SGE parallel environment (PE) set up as described in several places on 
> > the web.
> > 
> > Many thanks for any input!
> 
> Did you compile Open MPI on your own or was it delivered with the Linux 
> distribution? That it tries to use `ssh` is quite strange, as nowadays Open 
> MPI and others have built-in support to detect that they are running under 
> the control of a queuing system. It should use `qrsh` in your case.
> 
> What does:
> 
> mpiexec --version
> ompi_info | grep grid
> 
> reveal? What does:
> 
> qconf -sconf | egrep "(command|daemon)"
> 
> show?
> 
> -- Reuti
> 
> 
> > Cheers,
> > 
> > -David Laidlaw
> > 
> > 
> > 
> > 
> > Here is how I submit the job:
> > 
> >    /usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
> > 
> > 
> > Here is what is in runme:
> > 
> >   #!/bin/bash
> >   #$ -cwd
> >   #$ -pe orte_fill 1
> >   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
> > allocation ./hello
> > 
> > 
> > Here is hello.c:
> > 
> > #include <mpi.h>
> > #include <stdio.h>
> > #include <unistd.h>
> > #include <stdlib.h>
> > 
> > int main(int argc, char** argv) {
> >     // Initialize the MPI environment
> >     MPI_Init(NULL, NULL);
> > 
> >     // Get the number of processes
> >     int world_size;
> >     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> > 
> >     // Get the rank of the process
> >     int world_rank;
> >     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> > 
> >     // Get the name of the processor
> >     char processor_name[MPI_MAX_PROCESSOR_NAME];
> >     int name_len;
> >     MPI_Get_processor_name(processor_name, &name_len);
> > 
> >     // Print off a hello world message
> >     printf("Hello world from processor %s, rank %d out of %d processors\n",
> >            processor_name, world_rank, world_size);
> >     // system("printenv");
> > 
> >     sleep(15); // sleep for 60 seconds
> > 
> >     // Finalize the MPI environment.
> >     MPI_Finalize();
> > }
> > 
> > 
> > This command will build it:
> > 
> >      mpicc hello.c -o hello
> > 
> > 
> > Running produces the following:
> > 
> > /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
> > dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
> > --------------------------------------------------------------------------
> > ORTE was unable to reliably start one or more daemons.
> > This usually is caused by:
> > 
> > * not finding the required libraries and/or binaries on
> >   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
> >   settings, or configure OMPI with --enable-orterun-prefix-by-default
> > 
> > * lack of authority to execute on one or more specified nodes.
> >   Please verify your allocation and authorities.
> > 
> > * the inability to write startup files into /tmp 
> > (--tmpdir/orte_tmpdir_base).
> >   Please check with your sys admin to determine the correct location to use.
> > 
> > *  compilation of the orted with dynamic libraries when static are required
> >   (e.g., on Cray). Please check your configure cmd line and consider using
> >   one of the contrib/platform definitions for your system type.
> > 
> > * an inability to create a connection back to mpirun due to a
> >   lack of common network interfaces and/or no route found between
> >   them. Please check network connectivity (including firewalls
> >   and network routing requirements).
> > --------------------------------------------------------------------------
> > 
> > 
> > and:
> > 
> > [dblade01:10902] [[37323,0],0] plm:rsh: final template argv:
> >         /usr/bin/ssh <template>     set path = ( /usr/bin $path ) ; if ( $?
> > LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH
> >  == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 ) setenv
> > LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY
> > _PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv
> > DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY
> > LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ;   /usr/bin/orted 
> > --hnp-topo-sig
> > 0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo
> > bid "2446000128" -mca ess_base_vpid "<template>" -mca ess_base_num_procs 
> > "2" -
> > mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791"
> >  --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" -mca
> > pmix "^s1,s2,cray"
> > ssh_exchange_identification: read: Connection reset by peer
> > 
> > 
> > 
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> 

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to