Am 25.08.2014 um 13:23 schrieb Pengcheng Wang: > Hi Reuti, > > A simple hello_world program works without the h_vmem limit. Honestly, I am > not familiar with Open MPI. The command qconf -spl and qconf -sp ompi give > the information below.
Thx. > But strangely, it begins to work after I insert unset SGE_ROOT in my job > script. I don't know why. Unsetting this variable will make Open MPI unaware that it runs under SGE. Hence it will use `ssh` to reach other machines. These `ssh` calls will have no memory or time limit set then. As you run a singleton this shouldn't matter though. But: when you want to start additional threads (according to your "#$ -pe ompi* 6") you should use a PE with allocation rule "$pe_slots" so that all slots which SGE grants to your task are on one and the same machine. SGE will multiply the limit with the number of slots, but only with the count granted on the master node of the parallel job (resp. for each slave). How the other treads or tasks started is something you might look at. > However, it still cannot work smoothly through 60hrs I setup. After running > for about two hours, it stops without any error messages. Is this related to > the h_vemem limit? You can have a look in $SGE_ROOT/spool/<exechost>/messages (resp. your actual location of the spool directories) whether any limit was passed and triggered an abortion of the job (for all granted machines for this job). Also `qacct -j <job_id>` might give some hint whether the was an exitcode of 137 due to a kill -9. > $ qconf -spl > 16per > 1per > 2per > 4per > hadoop > make > ompi > openmp > > $ qconf -sp ompi > pe_name ompi > slots 9999 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $fill_up This will allow to collect the slots from several machines, not necessarily all will be on one and the same machine where the jobscript runs. > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > > SGE version: 6.1u6 > Open MPI version: 1.2.9 Both are really old versions. I fear I can't help much here as many things changed compared to the actual version 1.8.1 of Open MPI, while also SGE's latest version is 6.2u5 with SoGE being now at 8.1.7. -- Reuti > Job script updated: > #$ -S /bin/bash > #$ -N couple > #$ -cwd > #$ -j y > #$ -R y > #$ -l h_rt=62:00:00 > #$ -l h_vmem=2G > #$ -o couple.out > #$ -e couple.err > #$ -pe ompi* 8 > unset SGE_ROOT > ./app > > Thanks, > Pengcheng > > On Sun, Aug 24, 2014 at 1:00 PM, <users-requ...@open-mpi.org> wrote: > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: A daemon on node cl231 failed to start as expected (Reuti) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 23 Aug 2014 18:49:38 +0200 > From: Reuti <re...@staff.uni-marburg.de> > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] A daemon on node cl231 failed to start as > expected > Message-ID: > <8f21a4d9-9e8d-4e20-9ae6-04a495a33...@staff.uni-marburg.de> > Content-Type: text/plain; charset=windows-1252 > > Hi, > > Am 23.08.2014 um 16:09 schrieb Pengcheng Wang: > > > I need to run a single driver program that only require one proc with the > > command mpirun -np 1 ./app or ./app. But it will schedule the launch of > > other executable files including parallel and sequential computing. So I > > require more than one proc to run it. It can be run smoothly as an > > interactive job with the command below. > > > > qrsh -cwd -pe "ompi*" 6 -l h_rt=00:30:00,test=true ./app > > > > But after I submitted the job, a strange error occurred and it stopped... > > Please find the job script and error message below: > > > > ? job submission script: > > #$ -S /bin/bash > > #$ -N couple > > #$ -cwd > > #$ -j y > > #$ -l h_rt=05:00:00 > > #$ -l h_vmem=2G > > Is a simple hello_world program listing the threads working? Does it work > without the h_vmem limit? > > > > #$ -o couple.out > > #$ -pe ompi* 6 > > Which PEs can be addressed here? What are their allocation rules (looks like > you need "$pe_slots"). > > What version of SGE? > What version of Open MPI? > Compiled with --with-sge? > > For me it's working in either way. > > -- Reuti > > > > ./app > > > > error message: > > error: executing task of job 6777095 failed: > > [cl231:23777] ERROR: A daemon on node cl231 failed to start as expected. > > [cl231:23777] ERROR: There may be more information available from > > [cl231:23777] ERROR: the 'qstat -t' command on the Grid Engine tasks. > > [cl231:23777] ERROR: If the problem persists, please restart the > > [cl231:23777] ERROR: Grid Engine PE job > > [cl231:23777] ERROR: The daemon exited unexpectedly with status 1. > > > > Thanks for any help! > > > > Pengcheng > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2014/08/25141.php > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ------------------------------ > > End of users Digest, Vol 2966, Issue 1 > ************************************** > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25144.php