I have been trying to run some MPI jobs under SGE for almost a year without success. What seems like a very simple test program fails; the ingredients of it are below. Any suggestions on any piece of the test, reasons for failure, requests for additional info, configuration thoughts, etc. would be much appreciated. I suspect the linkage between SGIEand MPI, but can't identify the problem. We do have SGE support build into MPI. We also have the SGE parallel environment (PE) set up as described in several places on the web.
Many thanks for any input! Cheers, -David Laidlaw Here is how I submit the job: /usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme Here is what is in runme: #!/bin/bash #$ -cwd #$ -pe orte_fill 1 env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display- allocation ./hello Here is hello.c: #include <mpi.h> #include <stdio.h> #include <unistd.h> #include <stdlib.h> int main(int argc, char** argv) { // Initialize the MPI environment MPI_Init(NULL, NULL); // Get the number of processes int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); // Get the rank of the process int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); // Get the name of the processor char processor_name[MPI_MAX_PROCESSOR_NAME]; int name_len; MPI_Get_processor_name(processor_name, &name_len); // Print off a hello world message printf("Hello world from processor %s, rank %d out of %d processors\n", processor_name, world_rank, world_size); // system("printenv"); sleep(15); // sleep for 60 seconds // Finalize the MPI environment. MPI_Finalize(); } This command will build it: mpicc hello.c -o hello Running produces the following: /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- and: [dblade01:10902] [[37323,0],0] plm:rsh: final template argv: /usr/bin/ssh <template> set path = ( /usr/bin $path ) ; if ( $? LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 ) setenv LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY _PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ; /usr/bin/orted --hnp-topo-sig 0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo bid "2446000128" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" - mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791" --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" -mca pmix "^s1,s2,cray" ssh_exchange_identification: read: Connection reset by peer
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users