Re: [gridengine users] integrate BLCR and SGE

2012-07-09 Thread mahbube rustaee
On Mon, Jul 9, 2012 at 3:58 AM, Reuti wrote: > Am 08.07.2012 um 06:01 schrieb mahbube rustaee: > > > > > > > On Sat, Jul 7, 2012 at 6:02 PM, Reuti > wrote: > > Am 07.07.2012 um 15:18 schrieb mahbube rustaee: > > > > > On Wed, Jul 4, 2012 at 2:23 PM, Reuti > wrote: > > > Am 04.07.2012 um 05:59 s

[gridengine users] Default Shell bash not always found

2012-07-09 Thread Joseph A. Farran
Hello. I have a cluster with Rocks 5.4.3 (SL) and I believe this is a Rocks issue but not sure. I have my "sge_request" with the default shell of: -S /bin/bash My OGE scripts that start with "#!/bin/bash" work on some compute

Re: [gridengine users] integrate BLCR and SGE

2012-07-09 Thread mahbube rustaee
On Mon, Jul 9, 2012 at 4:59 PM, Reuti wrote: > Am 09.07.2012 um 11:53 schrieb mahbube rustaee: > > > yes, I mean that state is "Rr" for ever. job not completed at sge view! > > Aha, I got it in the wrong way, sorry. > > How does the job complete - a normal exit from the script. What's its > state

Re: [gridengine users] Stop executing jobs on job error?

2012-07-09 Thread Rayson Ho
1) There's the "exit_status" file in the job's spool directory, and you can just go to the active_jobs directory in the execd's spool, and in there you will find a subdirectory for each job. So you can just parse the file to get the exit status of the job. Note that you can do this with practically

Re: [gridengine users] Stop executing jobs on job error?

2012-07-09 Thread David Erickson
Hi Rayson- Looks like I am using 6.2u4-2ubuntu1. Thanks for the pointer on epilog scripts, I was unaware of them. Is there any documentation on what environment variables are available to them? And can you clarify what you mean by parsing the exit status file directly? I'm aware of using qacct

Re: [gridengine users] Stop executing jobs on job error?

2012-07-09 Thread Rayson Ho
If you are using Open Grid Scheduler/Grid Engine 2011.11 or later, then there is the $SGE_JOBEXIT_STAT variable set in the epilog, so you can check the exit status of the job. And then in the epilog, you can close the queue or in fact do whatever you want - like disable scheduling new jobs, etc. T

Re: [gridengine users] Stop executing jobs on job error?

2012-07-09 Thread David Erickson
On Mon, Jul 9, 2012 at 3:43 PM, Reuti wrote: > Am 10.07.2012 um 00:24 schrieb David Erickson: > >> Hi I have configured GE with a single queue and a single slot, jobs >> submitted are actually controlling a single fixed resource and must be >> run serially. Occasionally there is a possibility tha

Re: [gridengine users] Stop executing jobs on job error?

2012-07-09 Thread Reuti
Am 10.07.2012 um 00:24 schrieb David Erickson: > Hi I have configured GE with a single queue and a single slot, jobs > submitted are actually controlling a single fixed resource and must be > run serially. Occasionally there is a possibility that a job will > error out Error out in what way - ju

[gridengine users] Stop executing jobs on job error?

2012-07-09 Thread David Erickson
Hi I have configured GE with a single queue and a single slot, jobs submitted are actually controlling a single fixed resource and must be run serially. Occasionally there is a possibility that a job will error out due to a hardware or other problem on the fixed resource, is there some way to set

Re: [gridengine users] Java out of memory errors with gridengine on Ubuntu

2012-07-09 Thread Peter van Heusden
Thanks for the replies I got off list. Since the problem seems to come from default heap limits set by Java, my solution has been to set an alias java='java -Xmx256m' cluster-wide. My experimentation has revealed that Java honours the last -Xmx option, so users can set their own -Xmx without any d

Re: [gridengine users] Using Galaxy with SGE: Job output not returned from Cluster

2012-07-09 Thread Reuti
Hi, Am 09.07.2012 um 17:37 schrieb Sascha Kastens: > I have a very special problem and posted it to the Galaxy mailing list > already. > Unfortunately I have not received any response yet. So this list is my last > hope. > > Maybe I can find somebody in this list who uses Galaxy too or at lea

[gridengine users] Using Galaxy with SGE: Job output not returned from Cluster

2012-07-09 Thread Sascha Kastens
Hi all,   I have a very special problem and posted it to the Galaxy mailing list already. Unfortunately I have not received any response yet. So this list is my last hope.   Maybe I can find somebody in this list who uses Galaxy too or at least knows where I can look at.   Thank you for y

Re: [gridengine users] Java out of memory errors with gridengine on Ubuntu

2012-07-09 Thread Mazouzi
Hi, Its about virtual memory and Java heap size. Check this : http://stackoverflow.com/questions/561245/virtual-memory-usage-from-java-under-linux-too-much-memory-used On Mon, Jul 9, 2012 at 5:05 PM, Rayson Ho wrote: > See this blog post I wrote back in 2005: > > > http://web.archive.org/we

Re: [gridengine users] Java out of memory errors with gridengine on Ubuntu

2012-07-09 Thread Rayson Ho
See this blog post I wrote back in 2005: http://web.archive.org/web/20051219011530/http://gridengine.info/articles/2005/10/10/unlimited-stack-limit-on-solaris Rayson On Mon, Jul 9, 2012 at 10:41 AM, Peter van Heusden wrote: > Hi there > > I'm using gridengine 6.2u5-4 on Ubuntu 12.04. I've se

Re: [gridengine users] execd load sensors timing

2012-07-09 Thread Reuti
Am 09.07.2012 um 16:16 schrieb William Hay: >> >> That the values are still reported from the last run in `qhost -F ...`. But >> when the reboot is taking only a few minutes the load sensor would report >> the same value as before. Or do you upgrade the OS in just a load_report >> interval, so

Re: [gridengine users] execd load sensors timing

2012-07-09 Thread William Hay
On 9 July 2012 15:56, Reuti wrote: > Am 09.07.2012 um 16:16 schrieb William Hay: > >>> >>> That the values are still reported from the last run in `qhost -F ...`. But >>> when the reboot is taking only a few minutes the load sensor would report >>> the same value as before. Or do you upgrade th

[gridengine users] Java out of memory errors with gridengine on Ubuntu

2012-07-09 Thread Peter van Heusden
Hi there I'm using gridengine 6.2u5-4 on Ubuntu 12.04. I've set my cluster to have h_vmem as a consumable with a default of 4G (the compute nodes in my cluster have at least 64G of RAM each). I'm getting a really strange problem running java though. My jdk is openJDK 7, (package 7~u3-2.1.1~pre1-1u

Re: [gridengine users] execd load sensors timing

2012-07-09 Thread Reuti
Am 09.07.2012 um 15:54 schrieb William Hay: > On 9 July 2012 14:08, Reuti wrote: >> Am 09.07.2012 um 14:51 schrieb William Hay: >> >>> On 9 July 2012 12:50, Reuti wrote: Am 09.07.2012 um 11:42 schrieb William Hay: > When execd starts is it safe to assume that the load sensors wil

Re: [gridengine users] execd load sensors timing

2012-07-09 Thread William Hay
On 9 July 2012 14:08, Reuti wrote: > Am 09.07.2012 um 14:51 schrieb William Hay: > >> On 9 July 2012 12:50, Reuti wrote: >>> Am 09.07.2012 um 11:42 schrieb William Hay: >>> When execd starts is it safe to assume that the load sensors will be run and reported back to the qmaster/schedule

Re: [gridengine users] integrate BLCR and SGE

2012-07-09 Thread Dave Love
Reuti writes: > Am 09.07.2012 um 06:28 schrieb mahbube rustaee: > >> when job restarts and reschedule , job remains on "Rr" state after job be >> completed! >> top command shows some processes are running too. >> I checked ckpt.log for clean_method script and it shows clean_method doesn't >> ru

Re: [gridengine users] execd load sensors timing

2012-07-09 Thread Reuti
Am 09.07.2012 um 14:51 schrieb William Hay: > On 9 July 2012 12:50, Reuti wrote: >> Am 09.07.2012 um 11:42 schrieb William Hay: >> >>> When execd starts is it safe to assume that the load sensors will be >>> run and reported back to the qmaster/scheduler before the node is >>> declared >>> conta

Re: [gridengine users] execd load sensors timing

2012-07-09 Thread William Hay
On 9 July 2012 12:50, Reuti wrote: > Am 09.07.2012 um 11:42 schrieb William Hay: > >> When execd starts is it safe to assume that the load sensors will be >> run and reported back to the qmaster/scheduler before the node is >> declared >> contactable/eligible for scheduling again? >> >> I have a l

Re: [gridengine users] integrate BLCR and SGE

2012-07-09 Thread Reuti
Am 09.07.2012 um 11:53 schrieb mahbube rustaee: > yes, I mean that state is "Rr" for ever. job not completed at sge view! Aha, I got it in the wrong way, sorry. How does the job complete - a normal exit from the script. What's its state in `qacct`? -- Reuti > On Mon, Jul 9, 2012 at 1:48 PM,

Re: [gridengine users] execd load sensors timing

2012-07-09 Thread Reuti
Am 09.07.2012 um 11:42 schrieb William Hay: > When execd starts is it safe to assume that the load sensors will be > run and reported back to the qmaster/scheduler before the node is > declared > contactable/eligible for scheduling again? > > I have a load sensor that reports when the node was la

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-09 Thread William Hay
On 9 July 2012 11:09, Dave Love wrote: > William Hay writes: > >>> In the case of OpenMP jobs using the #$ -v and #$ -V prefix would be >>> superfluous as gridengine will start only one process. So the jsv method >>> appears the only way? >> Not sure what you mean about -v and -V. They contro

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-09 Thread Dave Love
William Hay writes: > On 6 July 2012 12:43, Mark Dixon wrote: ... >> 1) Tightly-integrated openmpi jobs started up orted on slave nodes with a >> $1 equal to " orted". I added something to strip initial spaces out to >> fix. Noticed with openmpi 1.4.0. > > That looks lime an OpenMPI bug to me.

Re: [gridengine users] DRMAA timeout with gridMathematica

2012-07-09 Thread Dave Love
"MacMullan, Hugh" writes: > Thanks! That's what I thought. Bah. I'll try to push this with them ... it > really would be nice if it was a more flexible implementation anyway. Like > logging ... it opens a .o and .e file for each kernel, which gets pretty ugly > when launching 65 kernels over a

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-09 Thread Reuti
Am 09.07.2012 um 12:03 schrieb Dave Love: > Reuti writes: > >> Oh, right. The shell is missing, so it's the executable and not the >> shell builtin. Interesting effect. I wonder, with is done in the >> kernel or qrsh_starter. > > I think it's the builtin_starter. You'd need to read the code fo

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-09 Thread Dave Love
William Hay writes: >>  In the case of OpenMP jobs using the #$ -v and #$ -V prefix would be >> superfluous as gridengine will start only one process. So the jsv method >> appears the only way? > Not sure what you mean about -v and -V. They control an environment > variable which controls the

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-09 Thread Dave Love
William Hay writes: > On 6 July 2012 10:48, Dave Love wrote: > I'm currently testing a starter_method for our cluster. So far I haven't > found any unintended differences from how it behaves without one. Can you, > or anyone else, suggest anything I should look out for? You essentially need t

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-09 Thread Dave Love
Reuti writes: > Oh, right. The shell is missing, so it's the executable and not the > shell builtin. Interesting effect. I wonder, with is done in the > kernel or qrsh_starter. I think it's the builtin_starter. You'd need to read the code for the messy details. > I mean: shell scripts in Linux

Re: [gridengine users] integrate BLCR and SGE

2012-07-09 Thread mahbube rustaee
yes, I mean that state is "Rr" for ever. job not completed at sge view! On Mon, Jul 9, 2012 at 1:48 PM, Reuti wrote: > Am 09.07.2012 um 06:28 schrieb mahbube rustaee: > > > when job restarts and reschedule , job remains on "Rr" state after job > be completed! > > top command shows some processes

[gridengine users] execd load sensors timing

2012-07-09 Thread William Hay
When execd starts is it safe to assume that the load sensors will be run and reported back to the qmaster/scheduler before the node is declared contactable/eligible for scheduling again? I have a load sensor that reports when the node was last booted and would like to be sure that the time used fo

Re: [gridengine users] integrate BLCR and SGE

2012-07-09 Thread Reuti
Am 09.07.2012 um 06:28 schrieb mahbube rustaee: > when job restarts and reschedule , job remains on "Rr" state after job be > completed! > top command shows some processes are running too. > I checked ckpt.log for clean_method script and it shows clean_method doesn't > run. > with qdel clean_me

Re: [gridengine users] export of environment variables from start_proc_args

2012-07-09 Thread William Hay
On 6 July 2012 20:10, Mark Dixon wrote: > On Fri, 6 Jul 2012, William Hay wrote: > ... >> That looks lime an OpenMPI bug to me. Possible alternative ugly solution: >> make " orted" a link to orted (or a script that execs orted). I tend to >> think of starter_method as being a bit of a swiss-army