Am 25.07.2019 um 18:59 schrieb David Laidlaw via users:

> I have been trying to run some MPI jobs under SGE for almost a year without 
> success.  What seems like a very simple test program fails; the ingredients 
> of it are below.  Any suggestions on any piece of the test, reasons for 
> failure, requests for additional info, configuration thoughts, etc. would be 
> much appreciated.  I suspect the linkage between SGIEand MPI, but can't 
> identify the problem.  We do have SGE support build into MPI.  We also have 
> the SGE parallel environment (PE) set up as described in several places on 
> the web.
> 
> Many thanks for any input!

Did you compile Open MPI on your own or was it delivered with the Linux 
distribution? That it tries to use `ssh` is quite strange, as nowadays Open MPI 
and others have built-in support to detect that they are running under the 
control of a queuing system. It should use `qrsh` in your case.

What does:

mpiexec --version
ompi_info | grep grid

reveal? What does:

qconf -sconf | egrep "(command|daemon)"

show?

-- Reuti


> Cheers,
> 
> -David Laidlaw
> 
> 
> 
> 
> Here is how I submit the job:
> 
>    /usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
> 
> 
> Here is what is in runme:
> 
>   #!/bin/bash
>   #$ -cwd
>   #$ -pe orte_fill 1
>   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
> allocation ./hello
> 
> 
> Here is hello.c:
> 
> #include <mpi.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <stdlib.h>
> 
> int main(int argc, char** argv) {
>     // Initialize the MPI environment
>     MPI_Init(NULL, NULL);
> 
>     // Get the number of processes
>     int world_size;
>     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> 
>     // Get the rank of the process
>     int world_rank;
>     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> 
>     // Get the name of the processor
>     char processor_name[MPI_MAX_PROCESSOR_NAME];
>     int name_len;
>     MPI_Get_processor_name(processor_name, &name_len);
> 
>     // Print off a hello world message
>     printf("Hello world from processor %s, rank %d out of %d processors\n",
>            processor_name, world_rank, world_size);
>     // system("printenv");
> 
>     sleep(15); // sleep for 60 seconds
> 
>     // Finalize the MPI environment.
>     MPI_Finalize();
> }
> 
> 
> This command will build it:
> 
>      mpicc hello.c -o hello
> 
> 
> Running produces the following:
> 
> /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
> dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> --------------------------------------------------------------------------
> 
> 
> and:
> 
> [dblade01:10902] [[37323,0],0] plm:rsh: final template argv:
>         /usr/bin/ssh <template>     set path = ( /usr/bin $path ) ; if ( $?
> LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH
>  == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 ) setenv
> LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY
> _PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv
> DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY
> LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ;   /usr/bin/orted --hnp-topo-sig
> 0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo
> bid "2446000128" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -
> mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791"
>  --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" -mca
> pmix "^s1,s2,cray"
> ssh_exchange_identification: read: Connection reset by peer
> 
> 
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to