As a suggestion can we see the configuration of your Parallel Environment?

qconf -spl

qconf -sp orte2

On Mon, 1 Jun 2020 at 22:20, Ralph Castain via users <
users@lists.open-mpi.org> wrote:

> Afraid I have no real ideas here. Best I can suggest is taking the qrsh
> cmd line from the prior debug output and try running it manually. This
> might give you a chance to manipulate it and see if you can identify what
> is causing it an issue, if anything. Without mpirun executing, the daemons
> will bark about being unable to connect back, so you might need to use some
> other test program for this purpose.
>
> I agree with Jeff - you should check to see where these messages are
> coming from:
>
>
> >> Server daemon successfully started with task id "1.cod4"
> >> Server daemon successfully started with task id "1.cod5"
> >> Server daemon successfully started with task id "1.cod6"
> >> Server daemon successfully started with task id "1.has6"
> >> Server daemon successfully started with task id "1.hpb12"
> >> Server daemon successfully started with task id "1.has4"
> >
> >> Unmatched ".
> >> Unmatched ".
> >> Unmatched ".
> >
>
>
> Could be a clue as to what is actually happening.
>
>
> > On Jun 1, 2020, at 1:57 PM, Kulshrestha, Vipul via users <
> users@lists.open-mpi.org> wrote:
> >
> > Thank Jeff & Ralph for your responses.
> >
> > I tried changing the verbose level to 5 using the option suggested by
> Ralph, but there was no difference in the output (so no additional
> information in the output).
> >
> > I also tried to replace the grid submission script to a command line
> qsub job submission, but got the same issue. Removing the use of job
> submission script, the qsub command looks like below. This uses mpirun
> option "--N 1" to ensure that only 1 process is launched by mpirun on one
> host.
> >
> > Do you have some suggestion on how I can go about investigating the root
> cause of the problem I am facing? I am able to run mpirun successfully, if
> I specify the same set of hosts (as allocated by grid) using mpirun host
> file. I have also pasted the verbose output with host file and the orted
> command looks very similar to the one generated for grid submission (except
> that it uses /usr/bin/ssh instead of /grid2/sge/bin/lx-amd64/qrsh.
> >
> > Thanks,
> > Vipul
> >
> >
> > qsub -N velsyn -pe orte2 10 -V -b y -cwd -j y -o $cwd/a -l
> "os=redhat6.7*" -q all /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --N 1
> -x LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH
> --merge-stderr-to-stdout --output-filename
> ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca
> orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca
> plm_rsh_no_tree_spawn 1 <application with arguments>
> >
> >
> > $ /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --hostfile host.txt -x
> VMW_HOME=$VMW_HOME -x VMW_BIN=$VMW_BIN -x
> LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH
> --merge-stderr-to-stdout --output-filename
> ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca
> orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca
> plm_rsh_no_tree_spawn 1 <application with arguments>
> >
> > [sox3:24416] [[26562,0],0] plm:rsh: final template argv:
> >        /usr/bin/ssh <template>     set path = (
> /build/openmpi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH ==
> 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) setenv
> LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if (
> $?OMPI_have_llp == 1 ) setenv LD_LIBRARY_PATH
> /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if (
> $?DYLD_LIBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH ==
> 0 ) setenv DYLD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if (
> $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH
> /build/openmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ;
>  /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca ess "env" -mca
> ess_base_jobid "1740767232" -mca ess_base_vpid "<template>" -mca
> ess_base_num_procs "6" -mca orte_node_regex
> "sox[1:3],bos[1:3],bos[2:15],bos[1:9],bos[2:12],bos[1:7]@0(6)" -mca
> orte_hnp_uri "1740767232.0;tcp://147.34.216.21:54496" --mca
> orte_base_help_aggregate "0" --mca plm_base_verbose "5" --mca
> plm_rsh_no_tree_spawn "1" -mca plm "rsh" -mca orte_output_filename
> "./veloce.log/velsyn/dvelsyn:nojobid,nocopy" -mca pmix
> "^s1,s2,cray,isolated"
> > [sox3:24416] [[26562,0],0] complete_setup on job [26562,1]
> > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command
> from [[26562,0],5]
> > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for
> job [26562,1]
> > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command
> from [[26562,0],4]
> > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for
> job [26562,1]
> > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command
> from [[26562,0],1]
> > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for
> job [26562,1]
> > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command
> from [[26562,0],2]
> > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for
> job [26562,1]
> > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command
> from [[26562,0],3]
> > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for
> job [26562,1]
> >
> > -----Original Message-----
> > From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com]
> > Sent: Monday, June 1, 2020 4:15 PM
> > To: Open MPI User's List <users@lists.open-mpi.org>
> > Cc: Kulshrestha, Vipul <vipul_kulshres...@mentor.com>
> > Subject: Re: [OMPI users] Running mpirun with grid
> >
> > On top of what Ralph said, I think that this output is unexpected:
> >
> >> Starting server daemon at host "cod5"Starting server daemon at host
> >> "cod6"Starting server daemon at host "has4"Starting server daemon at
> host "co d4"
> >>
> >>
> >>
> >> Starting server daemon at host "hpb12"Starting server daemon at host
> "has6"
> >>
> >> Server daemon successfully started with task id "1.cod4"
> >> Server daemon successfully started with task id "1.cod5"
> >> Server daemon successfully started with task id "1.cod6"
> >> Server daemon successfully started with task id "1.has6"
> >> Server daemon successfully started with task id "1.hpb12"
> >> Server daemon successfully started with task id "1.has4"
> >
> > I don't think that's coming from Open MPI.
> >
> > My guess is that something is apparently trying to parse (or run?) that
> output, and it's getting confused because that output is unexpected, and
> then you get these errors:
> >
> >> Unmatched ".
> >> Unmatched ".
> >> Unmatched ".
> >
> > And the Open MPI helper daemon doesn't actually start.  Therefore you
> get this error:
> >
> >> ----------------------------------------------------------------------
> >> ---- ORTE was unable to reliably start one or more daemons.
> >> This usually is caused by:
> > ...etc.
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
>
>
>

Reply via email to