Thanks for the input, John.  Here are some responses (inline):

On Thu, Jul 25, 2019 at 1:21 PM John Hearns via users <
users@lists.open-mpi.org> wrote:

> Have you checked your ssh between nodes?
>

ssh is not allowed between nodes, but my understanding is that processes
should be getting set up and run by SGE, since it handles the queuing.


> Also how is your Path set up?
>

It should be using the same startup scripts as I use on other machines
within our dept, since the filesystem and home directories are shared
across both grid and non-grid machines.  In any case, I have put in fully
qualified pathnames for everything that I start up.


> A. Construct a hosts file and mpirun by hand
>

I have looked at the hosts file, and it seems correct.  I don't know that I
can pass a hosts file to mpirun directly, since SGE queues things and
determines what hosts will be assigned.


>
> B. Use modules rather than. Bashrc files
>

Hmm.  I don't really understand this one.  (I know what both are, but I
don't understand the problem that would be solved by converting to
modules..)


> C. Slurm
>

I don't run the grid/cluster, so I can't choose the queuing tools that are
run.  There are plans to migrate to slurm at some point in the future, but
that doesn't help me now...

Thanks!

-David Laidlaw


>
> On Thu, 25 Jul 2019, 18:00 David Laidlaw via users, <
> users@lists.open-mpi.org> wrote:
>
>> I have been trying to run some MPI jobs under SGE for almost a year
>> without success.  What seems like a very simple test program fails; the
>> ingredients of it are below.  Any suggestions on any piece of the test,
>> reasons for failure, requests for additional info, configuration thoughts,
>> etc. would be much appreciated.  I suspect the linkage between SGIEand MPI,
>> but can't identify the problem.  We do have SGE support build into MPI.  We
>> also have the SGE parallel environment (PE) set up as described in several
>> places on the web.
>>
>> Many thanks for any input!
>>
>> Cheers,
>>
>> -David Laidlaw
>>
>>
>>
>>
>> Here is how I submit the job:
>>
>>    /usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
>>
>>
>> Here is what is in runme:
>>
>>   #!/bin/bash
>>   #$ -cwd
>>   #$ -pe orte_fill 1
>>   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
>> allocation ./hello
>>
>>
>> Here is hello.c:
>>
>> #include <mpi.h>
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <stdlib.h>
>>
>> int main(int argc, char** argv) {
>>     // Initialize the MPI environment
>>     MPI_Init(NULL, NULL);
>>
>>     // Get the number of processes
>>     int world_size;
>>     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>>
>>     // Get the rank of the process
>>     int world_rank;
>>     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>>
>>     // Get the name of the processor
>>     char processor_name[MPI_MAX_PROCESSOR_NAME];
>>     int name_len;
>>     MPI_Get_processor_name(processor_name, &name_len);
>>
>>     // Print off a hello world message
>>     printf("Hello world from processor %s, rank %d out of %d
>> processors\n",
>>            processor_name, world_rank, world_size);
>>     // system("printenv");
>>
>>     sleep(15); // sleep for 60 seconds
>>
>>     // Finalize the MPI environment.
>>     MPI_Finalize();
>> }
>>
>>
>> This command will build it:
>>
>>      mpicc hello.c -o hello
>>
>>
>> Running produces the following:
>>
>> /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
>> dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
>> --------------------------------------------------------------------------
>> ORTE was unable to reliably start one or more daemons.
>> This usually is caused by:
>>
>> * not finding the required libraries and/or binaries on
>>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>   settings, or configure OMPI with --enable-orterun-prefix-by-default
>>
>> * lack of authority to execute on one or more specified nodes.
>>   Please verify your allocation and authorities.
>>
>> * the inability to write startup files into /tmp
>> (--tmpdir/orte_tmpdir_base).
>>   Please check with your sys admin to determine the correct location to
>> use.
>>
>> *  compilation of the orted with dynamic libraries when static are
>> required
>>   (e.g., on Cray). Please check your configure cmd line and consider using
>>   one of the contrib/platform definitions for your system type.
>>
>> * an inability to create a connection back to mpirun due to a
>>   lack of common network interfaces and/or no route found between
>>   them. Please check network connectivity (including firewalls
>>   and network routing requirements).
>> --------------------------------------------------------------------------
>>
>>
>> and:
>>
>> [dblade01:10902] [[37323,0],0] plm:rsh: final template argv:
>>         /usr/bin/ssh <template>     set path = ( /usr/bin $path ) ; if (
>> $?
>> LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH
>>  == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 )
>> setenv
>> LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY
>> _PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv
>> DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY
>> LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ;   /usr/bin/orted
>> --hnp-topo-sig
>> 0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo
>> bid "2446000128" -mca ess_base_vpid "<template>" -mca ess_base_num_procs
>> "2" -
>> mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791"
>>  --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1"
>> -mca
>> pmix "^s1,s2,cray"
>> ssh_exchange_identification: read: Connection reset by peer
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to