Am 25.07.2019 um 23:00 schrieb David Laidlaw:
> Here is most of the command output when run on a grid machine: > > dblade65.dhl(101) mpiexec --version > mpiexec (OpenRTE) 2.0.2 This is some time old. I would suggest to install a fresh one. You can even compile one in your home directory and install it e.g. in $HOME/local/openmpi-3.1.4-gcc_7.4.0-shared ( by --prefix=…intended path…) and then access this for all your jobs (adjust for your version of gcc). In your ~/.bash_profile and the job script: DEFAULT_MANPATH="$(manpath -q)" MY_OMPI="$HOME/local/openmpi-3.1.4_gcc-7.4.0_shared" export PATH="$MY_OMPI/bin:$PATH" export LD_LIBRARY_PATH="$MY_OMPI/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}" export MANPATH="$MY_OMPI/share/man${DEFAULT_MANPATH:+:$DEFAULT_MANPATH}" unset MY_OMPI unset DEFAULT_MANPATH Essentially there is no conflict with the already installed version. > dblade65.dhl(102) ompi_info | grep grid > MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component > v2.0.2) > dblade65.dhl(103) c > denied: host "dblade65.cs.brown.edu" is neither submit nor admin host > dblade65.dhl(104) On a node it’s ok this way. > Does that suggest anything? > > qconf is restricted to sysadmins, which I am not. What error is output if you try it anyway? Usually the viewing is always possible. > I would note that we are running debian stretch on the cluster machines. On > some of our other (non-grid) machines, running debian buster, the output is: > > cslab3d.dhl(101) mpiexec --version > mpiexec (OpenRTE) 3.1.3 > Report bugs to http://www.open-mpi.org/community/help/ > cslab3d.dhl(102) ompi_info | grep grid > MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component > v3.1.3) If you compile on such a machine and intend to run it in the cluster it won't work, as the versions don't match. Therefore the above solution, to use a personal version available in your $HOME for compiling and running the applications. Side note: Open MPI binds the processes to cores by default. In case more than one MPI job is running on a node one will have to use `mpiexec --bind-to none …` as otherwise all jobs on this node will use core 0 upwards. -- Reuti > Thanks! > > -David Laidlaw > > On Thu, Jul 25, 2019 at 2:13 PM Reuti <re...@staff.uni-marburg.de> wrote: > > Am 25.07.2019 um 18:59 schrieb David Laidlaw via users: > > > I have been trying to run some MPI jobs under SGE for almost a year without > > success. What seems like a very simple test program fails; the ingredients > > of it are below. Any suggestions on any piece of the test, reasons for > > failure, requests for additional info, configuration thoughts, etc. would > > be much appreciated. I suspect the linkage between SGIEand MPI, but can't > > identify the problem. We do have SGE support build into MPI. We also have > > the SGE parallel environment (PE) set up as described in several places on > > the web. > > > > Many thanks for any input! > > Did you compile Open MPI on your own or was it delivered with the Linux > distribution? That it tries to use `ssh` is quite strange, as nowadays Open > MPI and others have built-in support to detect that they are running under > the control of a queuing system. It should use `qrsh` in your case. > > What does: > > mpiexec --version > ompi_info | grep grid > > reveal? What does: > > qconf -sconf | egrep "(command|daemon)" > > show? > > -- Reuti > > > > Cheers, > > > > -David Laidlaw > > > > > > > > > > Here is how I submit the job: > > > > /usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme > > > > > > Here is what is in runme: > > > > #!/bin/bash > > #$ -cwd > > #$ -pe orte_fill 1 > > env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display- > > allocation ./hello > > > > > > Here is hello.c: > > > > #include <mpi.h> > > #include <stdio.h> > > #include <unistd.h> > > #include <stdlib.h> > > > > int main(int argc, char** argv) { > > // Initialize the MPI environment > > MPI_Init(NULL, NULL); > > > > // Get the number of processes > > int world_size; > > MPI_Comm_size(MPI_COMM_WORLD, &world_size); > > > > // Get the rank of the process > > int world_rank; > > MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); > > > > // Get the name of the processor > > char processor_name[MPI_MAX_PROCESSOR_NAME]; > > int name_len; > > MPI_Get_processor_name(processor_name, &name_len); > > > > // Print off a hello world message > > printf("Hello world from processor %s, rank %d out of %d processors\n", > > processor_name, world_rank, world_size); > > // system("printenv"); > > > > sleep(15); // sleep for 60 seconds > > > > // Finalize the MPI environment. > > MPI_Finalize(); > > } > > > > > > This command will build it: > > > > mpicc hello.c -o hello > > > > > > Running produces the following: > > > > /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile > > dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED > > -------------------------------------------------------------------------- > > ORTE was unable to reliably start one or more daemons. > > This usually is caused by: > > > > * not finding the required libraries and/or binaries on > > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > > settings, or configure OMPI with --enable-orterun-prefix-by-default > > > > * lack of authority to execute on one or more specified nodes. > > Please verify your allocation and authorities. > > > > * the inability to write startup files into /tmp > > (--tmpdir/orte_tmpdir_base). > > Please check with your sys admin to determine the correct location to use. > > > > * compilation of the orted with dynamic libraries when static are required > > (e.g., on Cray). Please check your configure cmd line and consider using > > one of the contrib/platform definitions for your system type. > > > > * an inability to create a connection back to mpirun due to a > > lack of common network interfaces and/or no route found between > > them. Please check network connectivity (including firewalls > > and network routing requirements). > > -------------------------------------------------------------------------- > > > > > > and: > > > > [dblade01:10902] [[37323,0],0] plm:rsh: final template argv: > > /usr/bin/ssh <template> set path = ( /usr/bin $path ) ; if ( $? > > LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH > > == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 ) setenv > > LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY > > _PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv > > DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY > > LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ; /usr/bin/orted > > --hnp-topo-sig > > 0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo > > bid "2446000128" -mca ess_base_vpid "<template>" -mca ess_base_num_procs > > "2" - > > mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791" > > --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" -mca > > pmix "^s1,s2,cray" > > ssh_exchange_identification: read: Connection reset by peer > > > > > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users