Re: [gridengine users] Error 137 - trying to figure out what it means.

Reuti Mon, 14 Jan 2013 14:25:26 -0800

Hi,

Am 14.01.2013 um 23:08 schrieb Jake Carroll:


> So we tested out trying to hard set wall-time different for the specific
> user who's experiencing the Exit 137 issue. We noticed the jobs are still
> failing, however.

is there any message about the kill signal in the spooling directory's messages 
file of the node, i.e.:

/opt/gridengine/default/spool/compute-0-4/messages (search for the job id)

-- Reuti


> One job that was killed that included the wall-time setting. Obviously the
> job did not run for 24h, anyway input and outputs shown below.
> 
> --------
> - qsub b5_set112.sh
> 
> 
> 
> - b5_set11_2.sh:
> 
> #$ -cwd 
> #$ -l h_rt=24:00:00
> 
> #$ -l vf=20G
> #$ -N b5_set11_2
> #$ -m eas
> #$ -M someguy@somewhere
> /blah/blah/blah/bayesRsim <b5_set11_2.par
> 
> 
> - cat b5_set11_2.e1325823:
> /opt/gridengine/default/spool/compute-0-4/job_scripts/1325823: line 7:
> 8117 Killed                  /blah/blah/blag/bayesRsim < b5_set11_2.par
> 
> 
> -qacct -j 1325823
> ==============================================================
> qname        medium.q
> hostname     compute-0-4.local
> group        users 
> owner        someguy
> project      NONE  
> department   defaultdepartment
> jobname      b5_set11_2
> jobnumber    1325823
> taskid       undefined
> account      sge   
> priority     0     
> qsub_time    Mon Jan 14 15:36:49 2013
> start_time   Mon Jan 14 15:36:55 2013
> end_time     Mon Jan 14 18:11:56 2013
> granted_pe   NONE  
> slots        1     
> failed       0    
> exit_status  137   
> ru_wallclock 9301  
> ru_utime     9262.906
> ru_stime     7.916 
> ru_maxrss    13820636
> ru_ixrss     0     
> ru_ismrss    0     
> ru_idrss     0     
> ru_isrss     0     
> ru_minflt    46056 
> ru_majflt    26    
> ru_nswap     0     
> ru_inblock   392840
> ru_oublock   32    
> ru_msgsnd    0     
> ru_msgrcv    0     
> ru_nsignals  0     
> ru_nvcsw     536   
> ru_nivcsw    30791 
> cpu          9270.822
> mem          61688.906
> io           0.430 
> iow          0.000 
> maxvmem      13.302G
> arid         undefined
> 
> So, you mentioned "default time limit of your shell". My googling
> suggested trying to set a wall time limit, or have the user specify the
> wall time, but that did not help. A few google searches show the use of a
> global time limit for jobs in general, but make no reference to a default
> time limit of the shell. Am I supposed to be looking at limits such as
> s_rt and h_rt? If so, how go I manipulate these for the specific user? The
> queue_conf man page makes some reference to this, but it doesn't explain
> explicitly how to manipulate it globally or on a per user basis making
> reference to defaults or "shell".
> 
> Sorry - just stumbling through this and not finding it too intuitive.
> 
> 
> --JC
> 
> 
> 
> 
> On 14/01/13 10:34 AM, "Ron Chen" <[email protected]> wrote:
> 
>> Exit code 137 = process was killed because it exceeded the time limit,
>> and Google is your best friend if you have similar issues - and the
>> solution is to check the default time limit of your shell.
>> 
>> -Ron
>> 
>> 
>> ************************************************************************
>> 
>> Open Grid Scheduler - the official open source Grid Engine:
>> http://gridscheduler.sourceforge.net/
>> 
>> 
>> 
>> 
>> 
>> ________________________________
>> From: Jake Carroll <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Sent: Sunday, January 13, 2013 6:56 PM
>> Subject: [gridengine users] Error 137 - trying to figure out what it
>> means.
>> 
>> 
>> Hi all.
>> 
>> We're trying to figure out the answer to a problem that is escaping us.
>> We can usually self solve most of these issues, but this one, we're
>> having problems trapping and can't find any solid answers for after a lot
>> of looking around on online resources.
>> 
>> One of our quite capable users [read: he rarely needs our help with grid
>> engine] has an unusual issue with certain jobs (seemingly, randomly?)
>> crashing out on error 137. The code is predominantly C++ based running
>> atop SGE 6.2u5 on the ROCKS clusters platform. What is making it hard for
>> us is that sometimes these array based jobs (non PE's/parallel
>> environments and no mpi/mpich explicit in use) are only crashing
>> sometimes. Some, and not others. It seems almost quasi-random.
>> 
>> The code is written in fortran compiled with Intels ifort, using standard
>> code optimisation (compile flag 02). However, the code is also compiled
>> with optimisation turned off and traceback and error reporting turned on,
>> and in both cases  programs failed and no run-time error was printed. The
>> same code was also compiled with gfortran and did also produce error
>> '137'.
>> 
>> The code run successfully numerous times, but is doing something slightly
>> different each time due to random sampling and different model
>> specifications. There are 20 jobs because  analyses are run across 20
>> replicates of a simulations. Previously our user had
>> no problems running these 20 replicates across 11 different models
>> (20x11=220 runs). 
>> 
>> Some specifics:
>> 
>> Array jobMemory allocation is 20GB, and the job uses less than 14GB.
>> 
>> Submitted through a shell script qsub test.sh, where test sh looks like:
>> 
>> -------------------------------------------------------
>> #$ -cwd 
>> #$ -l vf=20G
>> #$ -N b1_set12_1
>> #$ -m eas
>> #$ -M [email protected]
>> /path/to/some/stuff/here/bayesRsim <b1_set12_1.par
>> 
>> --------------------------------------------------------------------------
>> ---------------------------------------
>> 
>> Intels default is 'static compiling' from what we understand, in anyway
>> no external libraries are used (although Intel uses its own MKL library).
>> 
>> 
>> We can't see any obvious memory starvation issues or resource contention
>> problems. Do you have any suggestions in things we could look at to trap
>> this? The error 137 stuff online, after looking around a little, seems
>> sparse at best.
>> 
>> Any help would be appreciated.
>> 
>> --JC
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users     
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Error 137 - trying to figure out what it means.

Reply via email to