Thanks for the input, John. Here are some responses (inline): On Thu, Jul 25, 2019 at 1:21 PM John Hearns via users < users@lists.open-mpi.org> wrote:
> Have you checked your ssh between nodes? > ssh is not allowed between nodes, but my understanding is that processes should be getting set up and run by SGE, since it handles the queuing. > Also how is your Path set up? > It should be using the same startup scripts as I use on other machines within our dept, since the filesystem and home directories are shared across both grid and non-grid machines. In any case, I have put in fully qualified pathnames for everything that I start up. > A. Construct a hosts file and mpirun by hand > I have looked at the hosts file, and it seems correct. I don't know that I can pass a hosts file to mpirun directly, since SGE queues things and determines what hosts will be assigned. > > B. Use modules rather than. Bashrc files > Hmm. I don't really understand this one. (I know what both are, but I don't understand the problem that would be solved by converting to modules..) > C. Slurm > I don't run the grid/cluster, so I can't choose the queuing tools that are run. There are plans to migrate to slurm at some point in the future, but that doesn't help me now... Thanks! -David Laidlaw > > On Thu, 25 Jul 2019, 18:00 David Laidlaw via users, < > users@lists.open-mpi.org> wrote: > >> I have been trying to run some MPI jobs under SGE for almost a year >> without success. What seems like a very simple test program fails; the >> ingredients of it are below. Any suggestions on any piece of the test, >> reasons for failure, requests for additional info, configuration thoughts, >> etc. would be much appreciated. I suspect the linkage between SGIEand MPI, >> but can't identify the problem. We do have SGE support build into MPI. We >> also have the SGE parallel environment (PE) set up as described in several >> places on the web. >> >> Many thanks for any input! >> >> Cheers, >> >> -David Laidlaw >> >> >> >> >> Here is how I submit the job: >> >> /usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme >> >> >> Here is what is in runme: >> >> #!/bin/bash >> #$ -cwd >> #$ -pe orte_fill 1 >> env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display- >> allocation ./hello >> >> >> Here is hello.c: >> >> #include <mpi.h> >> #include <stdio.h> >> #include <unistd.h> >> #include <stdlib.h> >> >> int main(int argc, char** argv) { >> // Initialize the MPI environment >> MPI_Init(NULL, NULL); >> >> // Get the number of processes >> int world_size; >> MPI_Comm_size(MPI_COMM_WORLD, &world_size); >> >> // Get the rank of the process >> int world_rank; >> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); >> >> // Get the name of the processor >> char processor_name[MPI_MAX_PROCESSOR_NAME]; >> int name_len; >> MPI_Get_processor_name(processor_name, &name_len); >> >> // Print off a hello world message >> printf("Hello world from processor %s, rank %d out of %d >> processors\n", >> processor_name, world_rank, world_size); >> // system("printenv"); >> >> sleep(15); // sleep for 60 seconds >> >> // Finalize the MPI environment. >> MPI_Finalize(); >> } >> >> >> This command will build it: >> >> mpicc hello.c -o hello >> >> >> Running produces the following: >> >> /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile >> dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED >> -------------------------------------------------------------------------- >> ORTE was unable to reliably start one or more daemons. >> This usually is caused by: >> >> * not finding the required libraries and/or binaries on >> one or more nodes. Please check your PATH and LD_LIBRARY_PATH >> settings, or configure OMPI with --enable-orterun-prefix-by-default >> >> * lack of authority to execute on one or more specified nodes. >> Please verify your allocation and authorities. >> >> * the inability to write startup files into /tmp >> (--tmpdir/orte_tmpdir_base). >> Please check with your sys admin to determine the correct location to >> use. >> >> * compilation of the orted with dynamic libraries when static are >> required >> (e.g., on Cray). Please check your configure cmd line and consider using >> one of the contrib/platform definitions for your system type. >> >> * an inability to create a connection back to mpirun due to a >> lack of common network interfaces and/or no route found between >> them. Please check network connectivity (including firewalls >> and network routing requirements). >> -------------------------------------------------------------------------- >> >> >> and: >> >> [dblade01:10902] [[37323,0],0] plm:rsh: final template argv: >> /usr/bin/ssh <template> set path = ( /usr/bin $path ) ; if ( >> $? >> LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH >> == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 ) >> setenv >> LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY >> _PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv >> DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY >> LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ; /usr/bin/orted >> --hnp-topo-sig >> 0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo >> bid "2446000128" -mca ess_base_vpid "<template>" -mca ess_base_num_procs >> "2" - >> mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791" >> --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" >> -mca >> pmix "^s1,s2,cray" >> ssh_exchange_identification: read: Connection reset by peer >> >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users