Am 25.07.2019 um 18:59 schrieb David Laidlaw via users:
> I have been trying to run some MPI jobs under SGE for almost a year without > success. What seems like a very simple test program fails; the ingredients > of it are below. Any suggestions on any piece of the test, reasons for > failure, requests for additional info, configuration thoughts, etc. would be > much appreciated. I suspect the linkage between SGIEand MPI, but can't > identify the problem. We do have SGE support build into MPI. We also have > the SGE parallel environment (PE) set up as described in several places on > the web. > > Many thanks for any input! Did you compile Open MPI on your own or was it delivered with the Linux distribution? That it tries to use `ssh` is quite strange, as nowadays Open MPI and others have built-in support to detect that they are running under the control of a queuing system. It should use `qrsh` in your case. What does: mpiexec --version ompi_info | grep grid reveal? What does: qconf -sconf | egrep "(command|daemon)" show? -- Reuti > Cheers, > > -David Laidlaw > > > > > Here is how I submit the job: > > /usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme > > > Here is what is in runme: > > #!/bin/bash > #$ -cwd > #$ -pe orte_fill 1 > env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display- > allocation ./hello > > > Here is hello.c: > > #include <mpi.h> > #include <stdio.h> > #include <unistd.h> > #include <stdlib.h> > > int main(int argc, char** argv) { > // Initialize the MPI environment > MPI_Init(NULL, NULL); > > // Get the number of processes > int world_size; > MPI_Comm_size(MPI_COMM_WORLD, &world_size); > > // Get the rank of the process > int world_rank; > MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); > > // Get the name of the processor > char processor_name[MPI_MAX_PROCESSOR_NAME]; > int name_len; > MPI_Get_processor_name(processor_name, &name_len); > > // Print off a hello world message > printf("Hello world from processor %s, rank %d out of %d processors\n", > processor_name, world_rank, world_size); > // system("printenv"); > > sleep(15); // sleep for 60 seconds > > // Finalize the MPI environment. > MPI_Finalize(); > } > > > This command will build it: > > mpicc hello.c -o hello > > > Running produces the following: > > /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile > dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > > * not finding the required libraries and/or binaries on > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to use. > > * compilation of the orted with dynamic libraries when static are required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common network interfaces and/or no route found between > them. Please check network connectivity (including firewalls > and network routing requirements). > -------------------------------------------------------------------------- > > > and: > > [dblade01:10902] [[37323,0],0] plm:rsh: final template argv: > /usr/bin/ssh <template> set path = ( /usr/bin $path ) ; if ( $? > LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH > == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 ) setenv > LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY > _PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv > DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY > LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ; /usr/bin/orted --hnp-topo-sig > 0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo > bid "2446000128" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" - > mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791" > --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" -mca > pmix "^s1,s2,cray" > ssh_exchange_identification: read: Connection reset by peer > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users