Hi,

Am 21.02.2012 um 20:20 schrieb Txema Heredia Genestar:

> Hello all,
> 
> I am having some problems to run threaded jobs in SGE 6.1u4. In our cluster, 
> h_vmem is defined as a consumable attribute in all nodes. It is mandatory, 
> all jobs must request it, with a default value of 6Gb. That constraint leads 
> any "parallel" job sent to the cluster to try to reserve a lot of memory 
> (h_vmem * slots). This is ok for most parallel processes (mpi and the such). 
> But, sometimes, we need to run "threaded" jobs, where all jobs share a chunk 
> of memory (everything on a single node). This leads to situations where I 
> need to send an 8-threaded job that requires, say, 10 Gb of memory, but it 
> cannot be scheduled because no node can handle a 80Gb request. When a memory 
> request cannot be fulfilled, the typical message of "cannot run in PE "smp" 
> because it only offers N slots" appears in qstat (where N is the maximum 
> number of slots I wolud be able to use given the requested h_vmem size).
> 
> This is the parallel environment I am trying to use:
> 
> # qconf -sp smp
> pe_name           smp
> slots             9999
> user_lists        test_users
> xuser_lists       NONE
> start_proc_args   /bin/true
> stop_proc_args    /bin/true
> allocation_rule   $fill_up

for SMP mode you will need $pe_slots here, unless you are requesting exactly 
one node in addition in the submission command.

I assume before you got simply more than one node.

==

The answer from Bob changing the complex h_vmem to JOB would help for this type 
of job, but not if you have also MPI jobs in the cluster. I had an RFE for 
introducing this on a PE level:

https://arc.liv.ac.uk/trac/SGE/ticket/197

To cite from the issue "Therefore I wrote, that an entry inthe PE would still 
be advantageous: h_vmem can only be JOBS or YES"

==

For now: you could adjust the memory request in a JSV depending on the 
requested PE, but for this you need 6.2 IIRC.

-- Reuti


> control_slaves    FALSE
> job_is_first_task FALSE
> urgency_slots     min
> 
> The most annoying part of all this is that this behaviour is not consistent: 
> This morning I've been able to run a 6-threaded job requesting 10Gb of memory 
> in a 48Gb node. But, in the afternoon, the same job using the very same 
> command in the same node could not be run.
> 
> Does anyone have any suggestion on how to deal with this?
> 
> Thanks in advance,
> 
> Txema
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to