Re: [gridengine users] jobs randomly die

2019-05-17 Thread Hay, William
> It's a limit being reached, of some sort. Do you have a RQS of > > > any kind (qconf -srqs)? We see this for job-requested, or system > > > set RAM exhaustion (OOM killer, as mentioned 'dmesg -T' on > > > compute nodes often useful), as well as time lim

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Daniel Povey
I have observed apparently random failures when users had gid's in the range `gid_range` (see below; gid_range should be out of the range where users have gid's). But usually this kind of thing would be due to OOM. qconf -sconf | grep gid_range gid_range5-51000 On Tue, M

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Reuti
AFAICS the sent kill by SGE happens after a task returned already with an error. SGE would in this case use the kill signal to be sure to kill all child processes. Hence the question would be: what was the initial command in the job script, and what output/error did it generate? -- Reuti > Am

Re: [gridengine users] jobs randomly die

2019-05-14 Thread MacMullan IV, Hugh
--Original Message- From: users-boun...@gridengine.org On Behalf Of hiller Sent: Tuesday, May 14, 2019 9:52 AM To: users@gridengine.org Subject: Re: [gridengine users] jobs randomly die ~> qconf -srqs No resource quota set found 'dmesg -T' does not give an oom or other weird mess

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Feng Zhang
> > well as time limits reached. What is the whole output from 'qacct -j JOBID'? > > > > Cheers, > > -Hugh > > > > -Original Message- > > From: users-boun...@gridengine.org On Behalf > > Of hiller > > Sent: Tuesday, May 14, 2019

Re: [gridengine users] jobs randomly die

2019-05-14 Thread hiller
> Sent: Tuesday, May 14, 2019 9:02 AM > To: users@gridengine.org > Subject: Re: [gridengine users] jobs randomly die > > Hi, > nope, there are no oom messages in the journal. > Regards, ulrich > > > On 5/14/19 12:49 PM, Arnau wrote: >> Hi, >> >> _m

Re: [gridengine users] jobs randomly die

2019-05-14 Thread MacMullan IV, Hugh
'qacct -j JOBID'? Cheers, -Hugh -Original Message- From: users-boun...@gridengine.org On Behalf Of hiller Sent: Tuesday, May 14, 2019 9:02 AM To: users@gridengine.org Subject: Re: [gridengine users] jobs randomly die Hi, nope, there are no oom messages in the journal. Regards,

Re: [gridengine users] jobs randomly die

2019-05-14 Thread hiller
Hi, nope, there are no oom messages in the journal. Regards, ulrich On 5/14/19 12:49 PM, Arnau wrote: > Hi, > > _maybe_ the OOM killer killed the job ? a look to messages will give you an > answer (I've seen this in my cluster). > > HTH, > Arnau > > El mar., 14 may. 2019 a las 12:37, hiller (

[gridengine users] jobs randomly die

2019-05-14 Thread hiller
Dear all, i have a problem that jobs sent to gridengine randomly die. The gridengine version is 8.1.9 The OS is opensuse 15.0 The gridengine messages file says: 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - killing job 05/13/2019 18:31:46|worker|karun|W|job 635659.1 faile