Hello all,
I am having some problems to run threaded jobs in SGE 6.1u4. In our
cluster, h_vmem is defined as a consumable attribute in all nodes. It is
mandatory, all jobs must request it, with a default value of 6Gb. That
constraint leads any "parallel" job sent to the cluster to try to
reserve a lot of memory (h_vmem * slots). This is ok for most parallel
processes (mpi and the such). But, sometimes, we need to run "threaded"
jobs, where all jobs share a chunk of memory (everything on a single
node). This leads to situations where I need to send an 8-threaded job
that requires, say, 10 Gb of memory, but it cannot be scheduled because
no node can handle a 80Gb request. When a memory request cannot be
fulfilled, the typical message of "cannot run in PE "smp" because it
only offers N slots" appears in qstat (where N is the maximum number of
slots I wolud be able to use given the requested h_vmem size).
This is the parallel environment I am trying to use:
# qconf -sp smp
pe_name smp
slots 9999
user_lists test_users
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task FALSE
urgency_slots min
The most annoying part of all this is that this behaviour is not
consistent: This morning I've been able to run a 6-threaded job
requesting 10Gb of memory in a 48Gb node. But, in the afternoon, the
same job using the very same command in the same node could not be run.
Does anyone have any suggestion on how to deal with this?
Thanks in advance,
Txema
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users