Re: [gridengine users] hard_queue_list & parallel environment error

Arturo Tue, 15 May 2012 07:26:05 -0700

Sorry but that is not the problem, every node is configured with 64 slots:


hostname              node045.cm.cluster
load_scaling          NONE
complex_values        slotsfree=8,ngpus=0

load_valuesarch=lx26-amd64,num_proc=64,mem_total=258468.914062M, \swap_total=31250.992188M,virtual_total=289719.906250M, \

                      load_avg=0.030000,load_short=0.000000, \
                      load_medium=0.030000,load_long=0.000000, \
                      mem_free=255753.085938M,swap_free=31250.992188M, \
                      virtual_free=287004.078125M,mem_used=2715.828125M, \
                      swap_used=0.000000M,virtual_used=2715.828125M, \
                      cpu=0.000000, \

m_topology=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT,\m_topology_inuse=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT,\

                      m_socket=4,m_core=32,np_load_avg=0.000469, \
                      np_load_short=0.000000,np_load_medium=0.000469, \
                      np_load_long=0.000000
processors            64


hostname              node046.cm.cluster
load_scaling          NONE
complex_values        slotsfree=8,ngpus=0

load_valuesarch=lx26-amd64,num_proc=64,mem_total=258472.425781M, \swap_total=15623.996094M,virtual_total=274096.421875M, \

                      load_avg=0.010000,load_short=0.000000, \
                      load_medium=0.010000,load_long=0.050000, \
                      mem_free=255065.765625M,swap_free=15623.996094M, \
                      virtual_free=270689.761719M,mem_used=3406.660156M, \
                      swap_used=0.000000M,virtual_used=3406.660156M, \
                      cpu=0.000000, \

                      m_socket=4,m_core=32,np_load_avg=0.000156, \
                      np_load_short=0.000000,np_load_medium=0.000156, \
                      np_load_long=0.000781
processors            64

qname                 conmat
hostlist              node045.cm.cluster
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               mpich mpich2_smpd mvapich openmpi
rerun                 FALSE
slots                 64
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            admins conmat
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default


qname                 test
hostlist              node046.cm.cluster node047.cm.cluster
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               mpich mpich2_smpd mvapich openmpi
rerun                 FALSE
slots                 64
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                /cm/shared/apps/sge/current/cm/epilog
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            admins
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default



It doesn't matter to which queue I submit the script.

I would use the built in slot complex, but when I use it gives me thiserror:


qsub -q conmat -l slots=5 submit.sh

Unable to run job: "job" denied: use parallel environments instead ofrequesting slots explicitly.

Exiting.


Regards

El 15/05/12 16:12, William Hay escribió:

Ok that makes more sense.  The queue instance on node045 is called
conmat not test.   If test only exists as a single slot on each of
node046 and node047
then when you request -q test you are restricting it to those two
slots which isn't enough for a 4 slot job.
We would really need the full output of qstat -f to be sure though.


William
On 15 May 2012 14:42, Arturo<[email protected]>  wrote:

More info:

output of qstat -f

---------------------------------------------------------------------------------
[email protected]      BIP   0/0/64         0.00     lx26-amd64
---------------------------------------------------------------------------------

############################################################################
  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
   74550 0.60500 test     arturo       qw    05/15/2012 15:26:50     4

qconf -sq test |grep slot

     slots                 64


qconf -sp openmpi |grep slots

slots              99999
urgency_slots      min

Regards

El 15/05/12 15:39, Arturo escribió:

Hi William,

you were right, it was running in various nodos:

   74545 0.60500 test     arturo       r     05/15/2012 15:17:46
[email protected]      MASTER

                                 [email protected]      SLAVE
   74545 0.60500 test     arturo       r     05/15/2012 15:17:46
[email protected]        SLAVE
   74545 0.60500 test     arturo       r     05/15/2012 15:17:46
[email protected]        SLAVE

Well, looking deeply, the problem is that I created a complex value
"slotsfree" consumable and requestable and I assigned it to the node045 with
the value:
slotsfree=8 (for example).

If I submit a job using a parallel environment to this node without
configuring this complex_value, it works perfectly.
And when I submit a job without using a PE to this node, but with this
complex_value configured, it also works,
but when I submit the same job, using a PE  and the complex_value, it doen't
work, and in the output it only says this:

cannot run in PE "openmpi" because it only offers 2 slots


Is it more clear now? Why does not work if I PE is configured without slot
imitation, the node has 64 slots, and the slotsfree value is greated than 4?

Thanks for your help.

Regards
Arturo


El 15/05/12 14:33, William Hay escribió:

On 15 May 2012 13:05, Arturo<[email protected]>   wrote:

Hi,

I have a very strange behaviour when I try to use a parallel environment
with hard_queue_list option.

In my script I have a parallel configuration:

      #$ -pe openmpi 4

and if submit the script in the following way it works and runs in node
test@node045

      qsub script.sh

But If I submit the script using the hard_queue_list it doesn't run:

      qsub -q test script.sh

With this error:

      cannot run in PE "openmpi" because it only offers 2 slots

Obviously, the node is always empty. What may be wrong?

It's hard to diagnose what's going on without knowing more about your
configuration.
Are you certain the entire job is running in the queue instance
test@node045 when you submit without a queue list?
One possibility is that queue test@node045 has only two slots.  The
master slot of the job plus one slave runs
in test@node045 while the remaining slots run elsewhere.

When the job is running what output do you get from qstat -g t?

William





--
Arturo Giner Gracia
HPC research group System Administrator
Instituto de Biocomputación y Física de Sistemas Complejos (BIFI)
Universidad de Zaragoza
e-mail: [email protected]
phone: (+34) 976762992



--
Arturo Giner Gracia
HPC research group System Administrator
Instituto de Biocomputación y Física de Sistemas Complejos (BIFI)
Universidad de Zaragoza
e-mail: [email protected]
phone: (+34) 976762992

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] hard_queue_list & parallel environment error

Reply via email to