On 21 February 2012 19:20, Txema Heredia Genestar <[email protected]> wrote:
> Hello all,
>
> I am having some problems to run threaded jobs in SGE 6.1u4. In our
> cluster, h_vmem is defined as a consumable attribute in all nodes. It is
> mandatory, all jobs must request it, with a default value of 6Gb. That
> constraint leads any "parallel" job sent to the cluster to try to
> reserve a lot of memory (h_vmem * slots). This is ok for most parallel
> processes (mpi and the such). But, sometimes, we need to run "threaded"
> jobs, where all jobs share a chunk of memory (everything on a single
> node). This leads to situations where I need to send an 8-threaded job
> that requires, say, 10 Gb of memory, but it cannot be scheduled because
> no node can handle a 80Gb request. When a memory request cannot be
> fulfilled, the typical message of "cannot run in PE "smp" because it
> only offers N slots" appears in qstat (where N is the maximum number of
> slots I wolud be able to use given the requested h_vmem size).
The trick we use here is that rather than set up a PE we just add a
per host consumable threads.
We use a JSV to ensure everyone requests at least one thread per slot.
 I'm not sure if a JSV
is available in 6.1u4 so you might have to trust your users.  In later
versions than the 6.2u3
we're using Grid Engine tries to allocate jobs to specific cores which
might create a few issues
since it doesn't know about our consumable.

I believe for h_vmem the resource consumption is agregated for all
slots on a node before being
applied so you could just get your users to divide their memory
consumption by slots when submitting
(or if you have a JSV get it to do that for them when using the SMP PE).


William




>
> This is the parallel environment I am trying to use:
>
> # qconf -sp smp
> pe_name           smp
> slots             9999
> user_lists        test_users
> xuser_lists       NONE
> start_proc_args   /bin/true
> stop_proc_args    /bin/true
> allocation_rule   $fill_up
> control_slaves    FALSE
> job_is_first_task FALSE
> urgency_slots     min
>
> The most annoying part of all this is that this behaviour is not
> consistent: This morning I've been able to run a 6-threaded job
> requesting 10Gb of memory in a 48Gb node. But, in the afternoon, the
> same job using the very same command in the same node could not be run.
>
> Does anyone have any suggestion on how to deal with this?
>
> Thanks in advance,
>
> Txema
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to