Re: [gridengine users] Automatic CPU core binding - JSV script

2012-01-12 Thread Daniel Gruber
While core binding itself should work with such an topology (I never tried it) in 6.2u5, the reporting of the topology string will be wrong. As you might noticed, string based load values are just reported up to a length of 1024 bytes, that means that with 1000 nodes not the full topology strin

Re: [gridengine users] core binding problem

2012-04-26 Thread Daniel Gruber
It is not needed for Linux hosts with UGE 8.0.0 and above, but still for Solaris hosts since they use processor sets, which could be misused. Cheers Daniel Am 26.04.2012 um 15:49 schrieb Rayson Ho: > On Thu, Apr 26, 2012 at 7:40 AM, Pablo Escobar wrote: >> ¿maybe I am missing something in the

Re: [gridengine users] final maxvmem of a job

2012-05-19 Thread Daniel Gruber
Am 19.05.2012 um 19:16 schrieb Farkas, Illes: > Hello, > > Is there a command (or an argument/switch of qsub) that tells the queue > manager to write into a file the maximum amount of memory used by one of the > jobs during its entire life time? To the best of my knowledge, after a job > fini

Re: [gridengine users] Reservations and parallel environments

2012-05-25 Thread Daniel Gruber
The expected behavior would be that when there is never an host with 12 slots free, that your cluster will be filled up with 4 slot jobs, even when they have lower priorities. The reservation you gave the 12 slot jobs will be attached at the end of that year. Hence I would assume that your 12 slo

Re: [gridengine users] Reservations and parallel environments

2012-05-25 Thread Daniel Gruber
t; On 25 May 2012 10:31, Daniel Gruber wrote: >> The expected behavior would be that when there is >> never an host with 12 slots free, that your cluster >> will be filled up with 4 slot jobs, even when they have >> lower priorities. The reservation you gave the 12 slot >

Re: [gridengine users] Reservations and parallel environments

2012-05-25 Thread Daniel Gruber
Am 25.05.2012 um 12:35 schrieb Richard Ems: > On 05/25/2012 12:27 PM, Daniel Gruber wrote: >> Exactly, looks like your runtime estimation for your slot4 jobs >> is smaller than for your slot12 jobs. Backfilling must be active >> here. Did you submit both jobs in exac

Re: [gridengine users] Understanding Parallel Enviroment ( whole nodes )

2012-06-08 Thread Daniel Gruber
with allocation rule fillup the scheduler tries to maximize the amount of slots which can be collected on any host. The host selection order depends usually *not* on the amount of free slots (anyway this could be configured). It looks like that you have either already some smaller jobs running on

Re: [gridengine users] qacct wildcards for parallel environments

2012-07-24 Thread Daniel Gruber
Try "qacct -b 120101 -pe" without anything. Daniel Am 24.07.2012 um 13:52 schrieb Nick Holway: > Dear all, > > I'm trying to get some aggregate stats for all our parallel > environments using qacct. I'm using "qacct -b 120101 -pe \*" and I > also tried it with the * in double quotes. Th

Re: [gridengine users] GPU node with pe and complex

2012-08-23 Thread Daniel Gruber
What you could do is creating a queue for each GPU you have on a host and assign them a queue exclusive GPU complex. The amount of GPU queues are limiting then the amount of GPU jobs. Then the total amount of cpu cores must be limited differently by a RQS on a per host basis. Daniel Am 23.08.2

Re: [gridengine users] Do not suspend job, kill instead

2012-08-23 Thread Daniel Gruber
You can set arbitrary signals to be sent when suspension is triggered (like SIGKILL). See: man queue_conf section "suspend_method" Daniel Am 24.08.2012 um 03:13 schrieb Joseph Farran: > Howdy. > > Is there a flag one can set on a job so that it will be killed instead of > being suspended for

Re: [gridengine users] Do not suspend job, kill instead

2012-08-24 Thread Daniel Gruber
. Am 24.08.2012 um 08:52 schrieb Daniel Gruber: > You can set arbitrary signals to be sent when suspension > is triggered (like SIGKILL). See: man queue_conf > section "suspend_method" > > Daniel > > Am 24.08.2012 um 03:13 schrieb Joseph Farran: > >> H

Re: [gridengine users] in tandem qsub running

2012-09-26 Thread Daniel Gruber
The easiest way would be to give a job a name with qsub -N job1 (or use -terse for getting the job id) and then using -hold_jid for the second job. More details you will find in the qsub man page. Of course you can also use DRMAA, or more unusual an array job with task throttling (-tc 1). Dani

Re: [gridengine users] How to Tell the running Wall Clock of a Job?

2012-10-25 Thread Daniel Gruber
Am 26.10.2012 um 07:58 schrieb Joseph Farran: > Howdy. > > One of my queues has a wall time hard limit of 4 days ( 96 hours ): > # qconf -sq queue | grep h_rt > h_rt 96:00:00 > > There is a job which has been running much longer than 4 days and I am not > sure how to get the

Re: [gridengine users] Distributed Job

2013-02-17 Thread Daniel Gruber
You need to configure a fixed allocation rule of 8 in your parallel environment and request that PE than on command line. It is common to have multiple parallel environments for the same job type with different allocation rules. qconf -mp yourpe_08 ... allocation_rule8 With a wildcard

Re: [gridengine users] jsv and MPI core bind questions

2013-03-26 Thread Daniel Gruber
Am 26.03.2013 um 17:10 schrieb Reuti: > Hi, > > Am 26.03.2013 um 12:17 schrieb Arnau Bria: > >> I'm migrating a bash jsv script to perl and adding some >> modifications, but I have some doubts: >> >> 1) jsv_correct vs jsv_accept. From man: >> >> If the result_type is ACCEPTED the job will be

Re: [gridengine users] How specify the queue name with the DRMAA api

2013-05-29 Thread Daniel Gruber
Since queue requests are not a part of DRMAA1 you should use "DRMAA_NATIVE_SPECIFICATION", which allows you to set (almost) any qsub command line parameter available. You can also use job categories but than you have to configure it in the qtasks file. DRMAA version 2 specifies "queueName" in the

Re: [gridengine users] Qlogin and Core binding

2014-06-27 Thread Daniel Gruber
Hi, Please notice the difference between "set linear:1:0,0“ and "set linear:1“. The first one means - give me one core starting at socket 0 core 0 (which means here obviously you are requesting core 0 on socket 0). The second means that you want one core on the host and the execution daemon takes

Re: [gridengine users] Enforce users to use specific amount of memory/slot

2014-06-30 Thread Daniel Gruber
There is unfortunately no way in SGE to limit main memory. h_rss / s_rss does not work with the rlimit call in Linux kernel version above 2.4. Hence in Univa Grid Engine we introduced multiple ways for doing main memory limitations. If you have cgroups support turned on then the cgroup takes car

Re: [gridengine users] Intel MIC integration with scheduler

2014-07-24 Thread Daniel Gruber
Hi Atul, We have included Intel Xeon Phi Support for Univa Grid Engine in the year 2012 in Univa Grid Engine 8.1.3. For that it required to add some new functionality which was missing in Sun Grid Engine. So basically what we did was: - Create a new resource type which allows you to do a mapping

Re: [gridengine users] Qsub flag for changing user

2015-09-14 Thread Daniel Gruber
Hi Joe, Univa Grid Engine 8.3 added such a functionality to its APIs (WebService API) so that you can submit on behalf of another user. The intention is to simplify building web portals. But this is restricted to users listed in the new sudoers Grid Engine ACL. We can chat privately about that i

Re: [gridengine users] Trying to code C program to process SGE job-status email

2015-10-20 Thread Daniel Gruber
Hi Bill, You changed the global configuration (qconf -mconf or qconf -mconf global). This is most likely overridden by the host local configuration. Try with changing it in the host local configuration (qconf -mconf ). You are right it takes a few seconds that the changes are propagated but it is

Re: [gridengine users] Core binding strange behaviour UGE 8.1

2016-02-29 Thread Daniel Gruber
Hi Mikhail That is indeed strange and the support request is handled properly in the support portal. Things I can imagine: You are using host resources which are requesting cores implicitly when requested (having cores attached with topology masks) or you are running into an rare strtok() issue

Re: [gridengine users] Looking for a solution to integrate SGE or similar with Jenkins/buildbot/similar and Vagrant/Docker/similar

2016-04-24 Thread Daniel Gruber
If you are referring to SGE_EXECD_PORT and SGE_QMASTER_PORT for example they are not really Univa specific. They are installation specific. If you installed UGE with self-set ports then they are required (set by settings.sh file). If you install it with taking out the ports from the services f

Re: [gridengine users] bsub -w "started(aJob)"

2016-06-13 Thread Daniel Gruber
No direct support for that in SGE. When a job is released from hold (like when another starts) does not mean it is executed. Hence you would not have not any guarantee that both are running at the same point in time. You could submit the successor before the other one and give it the job id of t

Re: [gridengine users] m_mem_free and cgroups

2020-08-12 Thread Daniel Gruber
Just to add to what Ondrej said - there are two different settings in the initial cgroup integration implemented. One allows to over-commit memory as long as there is no memory pressure in the kernel. But the actual behavior depends on the Linux kernel. For debugging what Grid Engine set you can

Re: [gridengine users] Maximum memory for running process?

2011-08-06 Thread Daniel Gruber
Am 03.08.2011 um 10:28 schrieb William Hay: > On 2 August 2011 17:58, Rayson Ho wrote: >> It's a bug introduced by another bug fix in SGE 6.2u5, and Oracle was >> first who fixed the bug in Oracle Grid Engine. Then we added a >> workaround in SGE 6.2u5p1 in Open Grid Scheduler, and Son of Grid >>

Re: [gridengine users] Maximum memory for running process?

2011-08-08 Thread Daniel Gruber
Am 08.08.2011 um 18:41 schrieb William Deegan: > On 8/6/2011 12:59 AM, Daniel Gruber wrote: >> Am 03.08.2011 um 10:28 schrieb William Hay: >> >>> On 2 August 2011 17:58, Rayson Ho wrote: >>>> It's a bug introduced by another bug fix in SGE 6.2u5, and O