I will try building a newer ompi version in my home directory, but that
will take me some time.

qconf is not available to me on any machine.  It provides that same error
wherever I am able to try it:

> denied: host ". <http://dblade65.cs.brown.edu/>.." is neither submit nor
admin host


Here is what it produces when I have a sysadmin run it:

$ qconf -sconf | egrep "(command|daemon)"
qlogin_command               /sysvol/sge.test/bin/qlogin-wrapper
qlogin_daemon                /sysvol/sge.test/bin/grid-sshd -i
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin


does that suggest anything?

Thanks!

-David Laidlaw




On Thu, Jul 25, 2019 at 5:21 PM Reuti <re...@staff.uni-marburg.de> wrote:

>
> Am 25.07.2019 um 23:00 schrieb David Laidlaw:
>
> > Here is most of the command output when run on a grid machine:
> >
> > dblade65.dhl(101) mpiexec --version
> > mpiexec (OpenRTE) 2.0.2
>
> This is some time old. I would suggest to install a fresh one. You can
> even compile one in your home directory and install it e.g. in
> $HOME/local/openmpi-3.1.4-gcc_7.4.0-shared ( by --prefix=…intended path…)
> and then access this for all your jobs (adjust for your version of gcc). In
> your ~/.bash_profile and the job script:
>
> DEFAULT_MANPATH="$(manpath -q)"
> MY_OMPI="$HOME/local/openmpi-3.1.4_gcc-7.4.0_shared"
> export PATH="$MY_OMPI/bin:$PATH"
> export
> LD_LIBRARY_PATH="$MY_OMPI/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
> export MANPATH="$MY_OMPI/share/man${DEFAULT_MANPATH:+:$DEFAULT_MANPATH}"
> unset MY_OMPI
> unset DEFAULT_MANPATH
>
> Essentially there is no conflict with the already installed version.
>
>
> > dblade65.dhl(102) ompi_info | grep grid
> >                  MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component
> v2.0.2)
> > dblade65.dhl(103) c
> > denied: host "dblade65.cs.brown.edu" is neither submit nor admin host
> > dblade65.dhl(104)
>
> On a node it’s ok this way.
>
>
> > Does that suggest anything?
> >
> > qconf is restricted to sysadmins, which I am not.
>
> What error is output if you try it anyway? Usually the viewing is always
> possible.
>
>
> > I would note that we are running debian stretch on the cluster
> machines.  On some of our other (non-grid) machines, running debian buster,
> the output is:
> >
> > cslab3d.dhl(101) mpiexec --version
> > mpiexec (OpenRTE) 3.1.3
> > Report bugs to http://www.open-mpi.org/community/help/
> > cslab3d.dhl(102) ompi_info | grep grid
> >                  MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component
> v3.1.3)
>
> If you compile on such a machine and intend to run it in the cluster it
> won't work, as the versions don't match. Therefore the above solution, to
> use a personal version available in your $HOME for compiling and running
> the applications.
>
> Side note: Open MPI binds the processes to cores by default. In case more
> than one MPI job is running on a node one will have to use `mpiexec
> --bind-to none …` as otherwise all jobs on this node will use core 0
> upwards.
>
> -- Reuti
>
>
> > Thanks!
> >
> > -David Laidlaw
> >
> > On Thu, Jul 25, 2019 at 2:13 PM Reuti <re...@staff.uni-marburg.de>
> wrote:
> >
> > Am 25.07.2019 um 18:59 schrieb David Laidlaw via users:
> >
> > > I have been trying to run some MPI jobs under SGE for almost a year
> without success.  What seems like a very simple test program fails; the
> ingredients of it are below.  Any suggestions on any piece of the test,
> reasons for failure, requests for additional info, configuration thoughts,
> etc. would be much appreciated.  I suspect the linkage between SGIEand MPI,
> but can't identify the problem.  We do have SGE support build into MPI.  We
> also have the SGE parallel environment (PE) set up as described in several
> places on the web.
> > >
> > > Many thanks for any input!
> >
> > Did you compile Open MPI on your own or was it delivered with the Linux
> distribution? That it tries to use `ssh` is quite strange, as nowadays Open
> MPI and others have built-in support to detect that they are running under
> the control of a queuing system. It should use `qrsh` in your case.
> >
> > What does:
> >
> > mpiexec --version
> > ompi_info | grep grid
> >
> > reveal? What does:
> >
> > qconf -sconf | egrep "(command|daemon)"
> >
> > show?
> >
> > -- Reuti
> >
> >
> > > Cheers,
> > >
> > > -David Laidlaw
> > >
> > >
> > >
> > >
> > > Here is how I submit the job:
> > >
> > >    /usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
> > >
> > >
> > > Here is what is in runme:
> > >
> > >   #!/bin/bash
> > >   #$ -cwd
> > >   #$ -pe orte_fill 1
> > >   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
> > > allocation ./hello
> > >
> > >
> > > Here is hello.c:
> > >
> > > #include <mpi.h>
> > > #include <stdio.h>
> > > #include <unistd.h>
> > > #include <stdlib.h>
> > >
> > > int main(int argc, char** argv) {
> > >     // Initialize the MPI environment
> > >     MPI_Init(NULL, NULL);
> > >
> > >     // Get the number of processes
> > >     int world_size;
> > >     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> > >
> > >     // Get the rank of the process
> > >     int world_rank;
> > >     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> > >
> > >     // Get the name of the processor
> > >     char processor_name[MPI_MAX_PROCESSOR_NAME];
> > >     int name_len;
> > >     MPI_Get_processor_name(processor_name, &name_len);
> > >
> > >     // Print off a hello world message
> > >     printf("Hello world from processor %s, rank %d out of %d
> processors\n",
> > >            processor_name, world_rank, world_size);
> > >     // system("printenv");
> > >
> > >     sleep(15); // sleep for 60 seconds
> > >
> > >     // Finalize the MPI environment.
> > >     MPI_Finalize();
> > > }
> > >
> > >
> > > This command will build it:
> > >
> > >      mpicc hello.c -o hello
> > >
> > >
> > > Running produces the following:
> > >
> > > /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
> > > dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
> > >
> --------------------------------------------------------------------------
> > > ORTE was unable to reliably start one or more daemons.
> > > This usually is caused by:
> > >
> > > * not finding the required libraries and/or binaries on
> > >   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
> > >   settings, or configure OMPI with --enable-orterun-prefix-by-default
> > >
> > > * lack of authority to execute on one or more specified nodes.
> > >   Please verify your allocation and authorities.
> > >
> > > * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
> > >   Please check with your sys admin to determine the correct location
> to use.
> > >
> > > *  compilation of the orted with dynamic libraries when static are
> required
> > >   (e.g., on Cray). Please check your configure cmd line and consider
> using
> > >   one of the contrib/platform definitions for your system type.
> > >
> > > * an inability to create a connection back to mpirun due to a
> > >   lack of common network interfaces and/or no route found between
> > >   them. Please check network connectivity (including firewalls
> > >   and network routing requirements).
> > >
> --------------------------------------------------------------------------
> > >
> > >
> > > and:
> > >
> > > [dblade01:10902] [[37323,0],0] plm:rsh: final template argv:
> > >         /usr/bin/ssh <template>     set path = ( /usr/bin $path ) ; if
> ( $?
> > > LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH
> > >  == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 )
> setenv
> > > LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY
> > > _PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 )
> setenv
> > > DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY
> > > LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ;   /usr/bin/orted
> --hnp-topo-sig
> > > 0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo
> > > bid "2446000128" -mca ess_base_vpid "<template>" -mca
> ess_base_num_procs "2" -
> > > mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791"
> > >  --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1"
> -mca
> > > pmix "^s1,s2,cray"
> > > ssh_exchange_identification: read: Connection reset by peer
> > >
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > users@lists.open-mpi.org
> > > https://lists.open-mpi.org/mailman/listinfo/users
> >
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to