Thanks.

$  qconf -spl
OpenMP
dist
make
mc4
oneper
orte
orte2
perf
run15
run25
run5
run50
thread
turbo
$  qconf -sp orte2
pe_name                orte2
slots                  99999
used_slots             0
bound_slots            0
user_lists             NONE
xuser_lists            NONE
start_proc_args        NONE
stop_proc_args         NONE
per_pe_task_prolog     NONE
per_pe_task_epilog     NONE
allocation_rule        2
control_slaves         TRUE
job_is_first_task      FALSE
urgency_slots          min
accounting_summary     FALSE
daemon_forks_slaves    FALSE
master_forks_slaves    FALSE

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of John Hearns 
via users
Sent: Tuesday, June 2, 2020 2:25 AM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: John Hearns <hear...@gmail.com>
Subject: Re: [OMPI users] Running mpirun with grid

As a suggestion can we see the configuration of your Parallel Environment?

qconf -spl

qconf -sp orte2

On Mon, 1 Jun 2020 at 22:20, Ralph Castain via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
Afraid I have no real ideas here. Best I can suggest is taking the qrsh cmd 
line from the prior debug output and try running it manually. This might give 
you a chance to manipulate it and see if you can identify what is causing it an 
issue, if anything. Without mpirun executing, the daemons will bark about being 
unable to connect back, so you might need to use some other test program for 
this purpose.

I agree with Jeff - you should check to see where these messages are coming 
from:


>> Server daemon successfully started with task id "1.cod4"
>> Server daemon successfully started with task id "1.cod5"
>> Server daemon successfully started with task id "1.cod6"
>> Server daemon successfully started with task id "1.has6"
>> Server daemon successfully started with task id "1.hpb12"
>> Server daemon successfully started with task id "1.has4"
>
>> Unmatched ".
>> Unmatched ".
>> Unmatched ".
>


Could be a clue as to what is actually happening.


> On Jun 1, 2020, at 1:57 PM, Kulshrestha, Vipul via users 
> <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
>
> Thank Jeff & Ralph for your responses.
>
> I tried changing the verbose level to 5 using the option suggested by Ralph, 
> but there was no difference in the output (so no additional information in 
> the output).
>
> I also tried to replace the grid submission script to a command line qsub job 
> submission, but got the same issue. Removing the use of job submission 
> script, the qsub command looks like below. This uses mpirun option "--N 1" to 
> ensure that only 1 process is launched by mpirun on one host.
>
> Do you have some suggestion on how I can go about investigating the root 
> cause of the problem I am facing? I am able to run mpirun successfully, if I 
> specify the same set of hosts (as allocated by grid) using mpirun host file. 
> I have also pasted the verbose output with host file and the orted command 
> looks very similar to the one generated for grid submission (except that it 
> uses /usr/bin/ssh instead of /grid2/sge/bin/lx-amd64/qrsh.
>
> Thanks,
> Vipul
>
>
> qsub -N velsyn -pe orte2 10 -V -b y -cwd -j y -o $cwd/a -l "os=redhat6.7*" -q 
> all /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --N 1  -x 
> LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH 
> --merge-stderr-to-stdout --output-filename 
> ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca 
> orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca 
> plm_rsh_no_tree_spawn 1 <application with arguments>
>
>
> $ /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --hostfile host.txt -x 
> VMW_HOME=$VMW_HOME -x VMW_BIN=$VMW_BIN -x 
> LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH 
> --merge-stderr-to-stdout --output-filename 
> ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca 
> orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca 
> plm_rsh_no_tree_spawn 1 <application with arguments>
>
> [sox3:24416] [[26562,0],0] plm:rsh: final template argv:
>        /usr/bin/ssh <template>     set path = ( 
> /build/openmpi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH == 1 
> ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) setenv LD_LIBRARY_PATH 
> /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( $?OMPI_have_llp == 1 ) setenv 
> LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if 
> ( $?DYLD_LIBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 
> 0 ) setenv DYLD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( 
> $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH 
> /build/openmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ;   
> /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca ess "env" -mca 
> ess_base_jobid "1740767232" -mca ess_base_vpid "<template>" -mca 
> ess_base_num_procs "6" -mca orte_node_regex 
> "sox[1:3],bos[1:3],bos[2:15],bos[1:9],bos[2:12],bos[1:7]@0(6)" -mca 
> orte_hnp_uri 
> "1740767232.0;tcp://147.34.216.21:54496<http://147.34.216.21:54496>" --mca 
> orte_base_help_aggregate "0" --mca plm_base_verbose "5" --mca 
> plm_rsh_no_tree_spawn "1" -mca plm "rsh" -mca orte_output_filename 
> "./veloce.log/velsyn/dvelsyn:nojobid,nocopy" -mca pmix "^s1,s2,cray,isolated"
> [sox3:24416] [[26562,0],0] complete_setup on job [26562,1]
> [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> [[26562,0],5]
> [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> [26562,1]
> [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> [[26562,0],4]
> [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> [26562,1]
> [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> [[26562,0],1]
> [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> [26562,1]
> [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> [[26562,0],2]
> [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> [26562,1]
> [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> [[26562,0],3]
> [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> [26562,1]
>
> -----Original Message-----
> From: Jeff Squyres (jsquyres) 
> [mailto:jsquy...@cisco.com<mailto:jsquy...@cisco.com>]
> Sent: Monday, June 1, 2020 4:15 PM
> To: Open MPI User's List 
> <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
> Cc: Kulshrestha, Vipul 
> <vipul_kulshres...@mentor.com<mailto:vipul_kulshres...@mentor.com>>
> Subject: Re: [OMPI users] Running mpirun with grid
>
> On top of what Ralph said, I think that this output is unexpected:
>
>> Starting server daemon at host "cod5"Starting server daemon at host
>> "cod6"Starting server daemon at host "has4"Starting server daemon at host 
>> "co d4"
>>
>>
>>
>> Starting server daemon at host "hpb12"Starting server daemon at host "has6"
>>
>> Server daemon successfully started with task id "1.cod4"
>> Server daemon successfully started with task id "1.cod5"
>> Server daemon successfully started with task id "1.cod6"
>> Server daemon successfully started with task id "1.has6"
>> Server daemon successfully started with task id "1.hpb12"
>> Server daemon successfully started with task id "1.has4"
>
> I don't think that's coming from Open MPI.
>
> My guess is that something is apparently trying to parse (or run?) that 
> output, and it's getting confused because that output is unexpected, and then 
> you get these errors:
>
>> Unmatched ".
>> Unmatched ".
>> Unmatched ".
>
> And the Open MPI helper daemon doesn't actually start.  Therefore you get 
> this error:
>
>> ----------------------------------------------------------------------
>> ---- ORTE was unable to reliably start one or more daemons.
>> This usually is caused by:
> ...etc.
>
> --
> Jeff Squyres
> jsquy...@cisco.com<mailto:jsquy...@cisco.com>
>

Reply via email to