Re: [OMPI users] A daemon on node cl231 failed to start as expected

Pengcheng Wang Tue, 26 Aug 2014 12:14:39 -0400 (EDT)

Hi Reuti,

Thanks a lot for your help.


The 'Openmp' PE in our clusters has the allocation rule 'pe_slots'. But I
guess I can only use limited slots for my job under this PE...

The command 'qacct -j jobID' gives the information below. It turns out the
job might exceed its memory allocation. After setting a larger h_vmem (5G),
it works now.

*$ qacct -j jobID*
*...*
*failed       100 : assumedly after job*
*exit_status  137*
*...*
*maxvmem      4.003G*

However, in this case, the number of slots my job can use is still limited.
For example, in one cluster, the job can run for a few seconds with 10
slots. Then the job state (qstat) becomes 'dr' and it is deleted by the
shell without any error messages. In another cluster, an error message
below will appear if I require more than 8 slots.





*[cl093:30366] mca_btl_mx_init: mx_open_endpoint() failed with
status=20--------------------------------------------------------------------------[0,3,0]:
Myrinet/MX on host cl093 was unable to find any endpoints. Another
transport will be used instead, although this may result inlower
performance.*

Anyway, it can temporarily work with 8 slots now, especially when these 8
slots are on the same machine coincidentally, which allows a large virtual
memory limit. It would be better if it can be run with more slots to save
computational time.

Best regards,
Pengcheng


On Mon, Aug 25, 2014 at 1:00 PM, <users-requ...@open-mpi.org> wrote:

> Send users mailing list submissions to
>         us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>         users-requ...@open-mpi.org
>
> You can reach the person managing the list at
>         users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>    1. Re: Running a hybrid MPI+openMP program (Reuti)
>    2. Re: A daemon on node cl231 failed to start as expected
>       (Pengcheng) (Pengcheng Wang)
>    3. Re: A daemon on node cl231 failed to start as expected
>       (Pengcheng) (Reuti)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 25 Aug 2014 11:51:35 +0200
> From: Reuti <re...@staff.uni-marburg.de>
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> Message-ID:
>         <9eae85f0-5479-45af-a8f1-14519216b...@staff.uni-marburg.de>
> Content-Type: text/plain; charset=us-ascii
>
> Am 21.08.2014 um 16:50 schrieb Reuti:
>
> > Am 21.08.2014 um 16:00 schrieb Ralph Castain:
> >
> >>
> >> On Aug 21, 2014, at 6:54 AM, Reuti <re...@staff.uni-marburg.de> wrote:
> >>
> >>> Am 21.08.2014 um 15:45 schrieb Ralph Castain:
> >>>
> >>>> On Aug 21, 2014, at 2:51 AM, Reuti <re...@staff.uni-marburg.de>
> wrote:
> >>>>
> >>>>> Am 20.08.2014 um 23:16 schrieb Ralph Castain:
> >>>>>
> >>>>>>
> >>>>>> On Aug 20, 2014, at 11:16 AM, Reuti <re...@staff.uni-marburg.de>
> wrote:
> >>>>>>
> >>>>>>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
> >>>>>>>
> >>>>>>>>> <snip>
> >>>>>>>>> Aha, this is quite interesting - how do you do this: scanning
> the /proc/<pid>/status or alike? What happens if you don't find enough free
> cores as they are used up by other applications already?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Remember, when you use mpirun to launch, we launch our own
> daemons using the native launcher (e.g., qsub). So the external RM will
> bind our daemons to the specified cores on each node. We use hwloc to
> determine what cores our daemons are bound to, and then bind our own child
> processes to cores within that range.
> >>>>>>>
> >>>>>>> Thx for reminding me of this. Indeed, I mixed up two different
> aspects in this discussion.
> >>>>>>>
> >>>>>>> a) What will happen in case no binding was done by the RM (hence
> Open MPI could use all cores) and two Open MPI jobs (or something
> completely different besides one Open MPI job) are running on the same node
> (due to the Tight Integration with two different Open MPI directories in
> /tmp and two `orted`, unique for each job)? Will the second Open MPI job
> know what the first Open MPI job used up already? Or will both use the same
> set of cores as "-bind-to none" can't be set in the given `mpiexec` command
> because of "-map-by slot:pe=$OMP_NUM_THREADS" was used - which triggers
> "-bind-to core" indispensable and can't be switched off? I see the same
> cores being used for both jobs.
> >>>>>>
> >>>>>> Yeah, each mpirun executes completely independently of the other,
> so they have no idea what the other is doing. So the cores will be
> overloaded. Multi-pe's requires bind-to-core otherwise there is no way to
> implement the request
> >>>>>
> >>>>> Yep, and so it's no option in a mixed cluster. Why would it hurt to
> allow "-bind-to none" here?
> >>>>
> >>>> Guess I'm confused here - what does pe=N mean if we bind-to none?? If
> you are running on a mixed cluster and don't want binding, then just say
> bind-to none and leave the pe argument out entirely as it wouldn't mean
> anything unless you are bound
> >>>
> >>> I would mean: divide the overall number of slots/cores in the
> machinefile by N (i.e. $OMP_NUM_THREADS).
> >>>
> >>> - Request made to the queuing system: I need 80 cores in total.
> >>> - The machinefile will contain 80 cores
> >>> - Open MPI will divide it by N, i.e. 8 here
> >>> - Open MPI will start only 10 processes, one on each node
> >>> - The application will use 8 threads per started MPI process
> >>
> >> I see - so you were talking about the case where the user doesn't
> provide the -np N option
> >
> > Yes. Even if -np is specified: AFAICS Open MPI fills up the given slots
> in the machinefile from the beginning (first nodes get all the processes,
> remaining nodes are free). Making it in a round-robin way would work better
> for this case.
>
> Could this be an option which include all cases:
>
> >> and we need to compute the number of procs to start. Okay, the change
> you requested below will fix that one too. I can make that easily enough.
> >
> > Therefore I wanted to start a discussion about it (at that time I wasn't
> aware of the "-map-by slot:pe=N" option), as I have no final syntax which
> would cover all cases. Someone may want the binding by the "-map-by
> slot:pe=N". How can this be specified, while keeping an easy
> tight-integration for users who don't want any binding at all.
> >
> > The boundary conditions are:
> >
> > - the job is running inside a queuingsystem
> > - the user requests the overall amount of slots to the queuingsystem
> > - hence the machinefile has entries for all slots
> > - the user sets OMP_NUM_THREADS
>
> - $OMP_NUM_THREADS set
> - -np set on the command line
>
> => change from fill-up to one per OMP_NUM_THREADS on each machine
> (including more than one)
>
> Using both as a trigger, it wouldn't touch case 2), which I don't want to
> be removed of course (as dividing all the time if $OMP_NUM_THREADS is used
> would do)
>
> -- Reuti
>
>
> > case 1) no interest in any binding, other jobs may exist on the nodes
> >
> > case 2) user wants binding: i.e. $OMP_NUM_THREADS cores assigned to each
> MPI process, maybe with "-map-by slot:pe=N"
> >
> > In both cases only (overall amount of slots) / ($OMP_NUM_THREADS) MPI
> processes should be started, not (overall amount of slots) processes AFAICS.
> >
> > -- Reuti
> >
> >
> >>> -- Reuti
> >>>
> >>>
> >>>>>
> >>>>>
> >>>>>>> Altering the machinefile instead: the processes are not bound to
> any core, and the OS takes care of a proper assignment.
> >>>>>
> >>>>> Here the ordinary user has to mangle the hostfile, this is not good
> (but allows several jobs per node as the OS shift the processes around).
> Could/should it be put into the "gridengine" module in OpenMPI, to divide
> the slot count per node automatically when $OMP_NUM_THREADS is found, or
> generate an error if it's not divisible?
> >>>>
> >>>> Sure, that could be done - but it will only have if OMP_NUM_THREADS
> is set when someone spins off threads. So far as I know, that's only used
> for OpenMP - so we'd get a little help, but it wouldn't be full coverage.
> >>>>
> >>>>
> >>>>>
> >>>>> ===
> >>>>>
> >>>>>>>> If the cores we are bound to are the same on each node, then we
> will do this with no further instruction. However, if the cores are
> different on the individual nodes, then you need to add --hetero-nodes to
> your command line (as the nodes appear to be heterogeneous to us).
> >>>>>>>
> >>>>>>> b) Aha, it's not about different type CPU types, but also same CPU
> type but different allocations between the nodes? It's not in the `mpiexec`
> man-page of 1.8.1 though. I'll have a look at it.
> >>>>>
> >>>>> I tried:
> >>>>>
> >>>>> $ qsub -binding linear:2:0 -pe smp2 8 -masterq parallel@node01 -q
> parallel@node0[1-4] test_openmpi.sh
> >>>>> Your job 247109 ("test_openmpi.sh") has been submitted
> >>>>> $ qsub -binding linear:2:1 -pe smp2 8 -masterq parallel@node01 -q
> parallel@node0[1-4] test_openmpi.sh
> >>>>> Your job 247110 ("test_openmpi.sh") has been submitted
> >>>>>
> >>>>>
> >>>>> Getting on node03:
> >>>>>
> >>>>>
> >>>>> 6733 ?        Sl     0:00  \_ sge_shepherd-247109 -bg
> >>>>> 6734 ?        SNs    0:00  |   \_
> /usr/sge/utilbin/lx24-amd64/qrsh_starter
> /var/spool/sge/node03/active_jobs/247109.1/1.node03
> >>>>> 6741 ?        SN     0:00  |       \_ orted -mca orte_hetero_nodes 1
> -mca ess env -mca orte_ess_jobid 1493303296 -mca orte_ess_vpid
> >>>>> 6742 ?        RNl    0:31  |           \_ ./mpihello
> >>>>> 6745 ?        Sl     0:00  \_ sge_shepherd-247110 -bg
> >>>>> 6746 ?        SNs    0:00      \_
> /usr/sge/utilbin/lx24-amd64/qrsh_starter
> /var/spool/sge/node03/active_jobs/247110.1/1.node03
> >>>>> 6753 ?        SN     0:00          \_ orted -mca orte_hetero_nodes 1
> -mca ess env -mca orte_ess_jobid 1506607104 -mca orte_ess_vpid
> >>>>> 6754 ?        RNl    0:25              \_ ./mpihello
> >>>>>
> >>>>>
> >>>>> reuti@node03:~> cat /proc/6741/status | grep Cpus_
> >>>>> Cpus_allowed:
>  
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
> >>>>> Cpus_allowed_list:        0-1
> >>>>> reuti@node03:~> cat /proc/6753/status | grep Cpus_
> >>>>> Cpus_allowed:
>  
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000030
> >>>>> Cpus_allowed_list:        4-5
> >>>>>
> >>>>> Hence, "orted" got two cores assigned for each of them. But:
> >>>>>
> >>>>>
> >>>>> reuti@node03:~> cat /proc/6742/status | grep Cpus_
> >>>>> Cpus_allowed:
>  
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
> >>>>> Cpus_allowed_list:        0-1
> >>>>> reuti@node03:~> cat /proc/6754/status | grep Cpus_
> >>>>> Cpus_allowed:
>  
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
> >>>>> Cpus_allowed_list:        0-1
> >>>>>
> >>>>> What I see here (and in `top` + pressing "1") that only two cores
> are used, and Open MPI assigns 0-1 to both jobs. The information in
> "status" is not the one OpenMPI gets from hwloc?
> >>>>>
> >>>>> -- Reuti
> >>>>>
> >>>>>
> >>>>>> The man page is probably a little out-of-date in this area - but
> yes, --hetero-nodes is required for *any* difference in the way the nodes
> appear to us (cpus, slot assignments, etc.). The 1.9 series may remove that
> requirement - still looking at it.
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> So it is up to the RM to set the constraint - we just live within
> it.
> >>>>>>>
> >>>>>>> Fine.
> >>>>>>>
> >>>>>>> -- Reuti
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> us...@open-mpi.org
> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25097.php
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> us...@open-mpi.org
> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25098.php
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25106.php
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25111.php
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25112.php
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25113.php
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25114.php
>
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 25 Aug 2014 08:23:47 -0300
> From: Pengcheng Wang <wpc...@gmail.com>
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] A daemon on node cl231 failed to start as
>         expected        (Pengcheng)
> Message-ID:
>         <CAPdTcQhcsLhRoeowmC9RwhYGB2--JL0Zo2Ccj=
> had1s8l9l...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Reuti,
>
> A simple hello_world program works without the h_vmem limit. Honestly, I am
> not familiar with Open MPI. The command qconf -spl and qconf -sp ompi give
> the information below. But strangely, it begins to work after I insert
> *unset
> SGE_ROOT* in my job script. I don't know why.
>
> However, it still cannot work smoothly through 60hrs I setup. After running
> for about two hours, it stops without any error messages. Is this related
> to the h_vemem limit?
>
> *$ qconf -spl*
> 16per
> 1per
> 2per
> 4per
> hadoop
> make
> ompi
> openmp
>
> *$ qconf -sp ompi*
> pe_name           ompi
> slots             9999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /bin/true
> stop_proc_args    /bin/true
> allocation_rule   $fill_up
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> SGE version: 6.1u6
> Open MPI version: 1.2.9
>
> *Job script updated:*
> #$ -S /bin/bash
> #$ -N couple
> #$ -cwd
> #$ -j y
> #$ -R y
> #$ -l h_rt=62:00:00
> #$ -l h_vmem=2G
> #$ -o couple.out
> #$ -e couple.err
> #$ -pe ompi* 8
> *unset SGE_ROOT*
>    ./app
>
> Thanks,
> Pengcheng
>
> On Sun, Aug 24, 2014 at 1:00 PM, <users-requ...@open-mpi.org> wrote:
>
> > Send users mailing list submissions to
> >         us...@open-mpi.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >         http://www.open-mpi.org/mailman/listinfo.cgi/users
> > or, via email, send a message with subject or body 'help' to
> >         users-requ...@open-mpi.org
> >
> > You can reach the person managing the list at
> >         users-ow...@open-mpi.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of users digest..."
> >
> >
> > Today's Topics:
> >
> >    1. Re: A daemon on node cl231 failed to start as expected (Reuti)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Sat, 23 Aug 2014 18:49:38 +0200
> > From: Reuti <re...@staff.uni-marburg.de>
> > To: Open MPI Users <us...@open-mpi.org>
> > Subject: Re: [OMPI users] A daemon on node cl231 failed to start as
> >         expected
> > Message-ID:
> >         <8f21a4d9-9e8d-4e20-9ae6-04a495a33...@staff.uni-marburg.de>
> > Content-Type: text/plain; charset=windows-1252
> >
> > Hi,
> >
> > Am 23.08.2014 um 16:09 schrieb Pengcheng Wang:
> >
> > > I need to run a single driver program that only require one proc with
> > the command mpirun -np 1 ./app or ./app. But it will schedule the launch
> of
> > other executable files including parallel and sequential computing. So I
> > require more than one proc to run it. It can be run smoothly as an
> > interactive job with the command below.
> > >
> > > qrsh -cwd -pe "ompi*" 6 -l h_rt=00:30:00,test=true ./app
> > >
> > > But after I submitted the job, a strange error occurred and it
> > stopped... Please find the job script and error message below:
> > >
> > > ? job submission script:
> > > #$ -S /bin/bash
> > > #$ -N couple
> > > #$ -cwd
> > > #$ -j y
> > > #$ -l h_rt=05:00:00
> > > #$ -l h_vmem=2G
> >
> > Is a simple hello_world program listing the threads working? Does it work
> > without the h_vmem limit?
> >
> >
> > > #$ -o couple.out
> > > #$ -pe ompi*  6
> >
> > Which PEs can be addressed here? What are their allocation rules (looks
> > like you need "$pe_slots").
> >
> > What version of SGE?
> > What version of Open MPI?
> > Compiled with --with-sge?
> >
> > For me it's working in either way.
> >
> > -- Reuti
> >
> >
> > >     ./app
> > >
> > > error message:
> > > error: executing task of job 6777095 failed:
> > > [cl231:23777] ERROR: A daemon on node cl231 failed to start as
> expected.
> > > [cl231:23777] ERROR: There may be more information available from
> > > [cl231:23777] ERROR: the 'qstat -t' command on the Grid Engine tasks.
> > > [cl231:23777] ERROR: If the problem persists, please restart the
> > > [cl231:23777] ERROR: Grid Engine PE job
> > > [cl231:23777] ERROR: The daemon exited unexpectedly with status 1.
> > >
> > > Thanks for any help!
> > >
> > > Pengcheng
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > Link to this post:
> > http://www.open-mpi.org/community/lists/users/2014/08/25141.php
> >
> >
> >
> > ------------------------------
> >
> > Subject: Digest Footer
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ------------------------------
> >
> > End of users Digest, Vol 2966, Issue 1
> > **************************************
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Message: 3
> Date: Mon, 25 Aug 2014 14:16:35 +0200
> From: Reuti <re...@staff.uni-marburg.de>
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] A daemon on node cl231 failed to start as
>         expected        (Pengcheng)
> Message-ID:
>         <e4e52447-76db-4564-b20a-cb42eeb34...@staff.uni-marburg.de>
> Content-Type: text/plain; charset=us-ascii
>
> Am 25.08.2014 um 13:23 schrieb Pengcheng Wang:
>
> > Hi Reuti,
> >
> > A simple hello_world program works without the h_vmem limit. Honestly, I
> am not familiar with Open MPI. The command qconf -spl and qconf -sp ompi
> give the information below.
>
> Thx.
>
>
> > But strangely, it begins to work after I insert unset SGE_ROOT in my job
> script. I don't know why.
>
> Unsetting this variable will make Open MPI unaware that it runs under SGE.
> Hence it will use `ssh` to reach other machines. These `ssh` calls will
> have no memory or time limit set then.
>
> As you run a singleton this shouldn't matter though. But: when you want to
> start additional threads (according to your "#$ -pe ompi*  6") you should
> use a PE with allocation rule "$pe_slots" so that all slots which SGE
> grants to your task are on one and the same machine.
>
> SGE will multiply the limit with the number of slots, but only with the
> count granted on the master node of the parallel job (resp. for each
> slave). How the other treads or tasks started is something you might look
> at.
>
>
> > However, it still cannot work smoothly through 60hrs I setup. After
> running for about two hours, it stops without any error messages. Is this
> related to the h_vemem limit?
>
> You can have a look in $SGE_ROOT/spool/<exechost>/messages (resp. your
> actual location of the spool directories) whether any limit was passed and
> triggered an abortion of the job (for all granted machines for this job).
> Also `qacct -j <job_id>` might give some hint whether the was an exitcode
> of 137 due to a kill -9.
>
>
> > $ qconf -spl
> > 16per
> > 1per
> > 2per
> > 4per
> > hadoop
> > make
> > ompi
> > openmp
> >
> > $ qconf -sp ompi
> > pe_name           ompi
> > slots             9999
> > user_lists        NONE
> > xuser_lists       NONE
> > start_proc_args   /bin/true
> > stop_proc_args    /bin/true
> > allocation_rule   $fill_up
>
> This will allow to collect the slots from several machines, not
> necessarily all will be on one and the same machine where the jobscript
> runs.
>
>
> > control_slaves    TRUE
> > job_is_first_task FALSE
> > urgency_slots     min
> >
> > SGE version: 6.1u6
> > Open MPI version: 1.2.9
>
> Both are really old versions. I fear I can't help much here as many things
> changed compared to the actual version 1.8.1 of Open MPI, while also SGE's
> latest version is 6.2u5 with SoGE being now at 8.1.7.
>
> -- Reuti
>
>
> > Job script updated:
> > #$ -S /bin/bash
> > #$ -N couple
> > #$ -cwd
> > #$ -j y
> > #$ -R y
> > #$ -l h_rt=62:00:00
> > #$ -l h_vmem=2G
> > #$ -o couple.out
> > #$ -e couple.err
> > #$ -pe ompi* 8
> > unset SGE_ROOT
> >    ./app
> >
> > Thanks,
> > Pengcheng
> >
> > On Sun, Aug 24, 2014 at 1:00 PM, <users-requ...@open-mpi.org> wrote:
> > Send users mailing list submissions to
> >         us...@open-mpi.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >         http://www.open-mpi.org/mailman/listinfo.cgi/users
> > or, via email, send a message with subject or body 'help' to
> >         users-requ...@open-mpi.org
> >
> > You can reach the person managing the list at
> >         users-ow...@open-mpi.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of users digest..."
> >
> >
> > Today's Topics:
> >
> >    1. Re: A daemon on node cl231 failed to start as expected (Reuti)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Sat, 23 Aug 2014 18:49:38 +0200
> > From: Reuti <re...@staff.uni-marburg.de>
> > To: Open MPI Users <us...@open-mpi.org>
> > Subject: Re: [OMPI users] A daemon on node cl231 failed to start as
> >         expected
> > Message-ID:
> >         <8f21a4d9-9e8d-4e20-9ae6-04a495a33...@staff.uni-marburg.de>
> > Content-Type: text/plain; charset=windows-1252
> >
> > Hi,
> >
> > Am 23.08.2014 um 16:09 schrieb Pengcheng Wang:
> >
> > > I need to run a single driver program that only require one proc with
> the command mpirun -np 1 ./app or ./app. But it will schedule the launch of
> other executable files including parallel and sequential computing. So I
> require more than one proc to run it. It can be run smoothly as an
> interactive job with the command below.
> > >
> > > qrsh -cwd -pe "ompi*" 6 -l h_rt=00:30:00,test=true ./app
> > >
> > > But after I submitted the job, a strange error occurred and it
> stopped... Please find the job script and error message below:
> > >
> > > ? job submission script:
> > > #$ -S /bin/bash
> > > #$ -N couple
> > > #$ -cwd
> > > #$ -j y
> > > #$ -l h_rt=05:00:00
> > > #$ -l h_vmem=2G
> >
> > Is a simple hello_world program listing the threads working? Does it
> work without the h_vmem limit?
> >
> >
> > > #$ -o couple.out
> > > #$ -pe ompi*  6
> >
> > Which PEs can be addressed here? What are their allocation rules (looks
> like you need "$pe_slots").
> >
> > What version of SGE?
> > What version of Open MPI?
> > Compiled with --with-sge?
> >
> > For me it's working in either way.
> >
> > -- Reuti
> >
> >
> > >     ./app
> > >
> > > error message:
> > > error: executing task of job 6777095 failed:
> > > [cl231:23777] ERROR: A daemon on node cl231 failed to start as
> expected.
> > > [cl231:23777] ERROR: There may be more information available from
> > > [cl231:23777] ERROR: the 'qstat -t' command on the Grid Engine tasks.
> > > [cl231:23777] ERROR: If the problem persists, please restart the
> > > [cl231:23777] ERROR: Grid Engine PE job
> > > [cl231:23777] ERROR: The daemon exited unexpectedly with status 1.
> > >
> > > Thanks for any help!
> > >
> > > Pengcheng
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25141.php
> >
> >
> >
> > ------------------------------
> >
> > Subject: Digest Footer
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ------------------------------
> >
> > End of users Digest, Vol 2966, Issue 1
> > **************************************
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25144.php
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ------------------------------
>
> End of users Digest, Vol 2967, Issue 1
> **************************************
>

Re: [OMPI users] A daemon on node cl231 failed to start as expected

Reply via email to