Sorry but that is not the problem, every node is configured with 64 slots:
hostname node045.cm.cluster
load_scaling NONE
complex_values slotsfree=8,ngpus=0
load_values
arch=lx26-amd64,num_proc=64,mem_total=258468.914062M, \
swap_total=31250.992188M,virtual_total=289719.906250M, \
load_avg=0.030000,load_short=0.000000, \
load_medium=0.030000,load_long=0.000000, \
mem_free=255753.085938M,swap_free=31250.992188M, \
virtual_free=287004.078125M,mem_used=2715.828125M, \
swap_used=0.000000M,virtual_used=2715.828125M, \
cpu=0.000000, \
m_topology=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT,
\
m_topology_inuse=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT,
\
m_socket=4,m_core=32,np_load_avg=0.000469, \
np_load_short=0.000000,np_load_medium=0.000469, \
np_load_long=0.000000
processors 64
hostname node046.cm.cluster
load_scaling NONE
complex_values slotsfree=8,ngpus=0
load_values
arch=lx26-amd64,num_proc=64,mem_total=258472.425781M, \
swap_total=15623.996094M,virtual_total=274096.421875M, \
load_avg=0.010000,load_short=0.000000, \
load_medium=0.010000,load_long=0.050000, \
mem_free=255065.765625M,swap_free=15623.996094M, \
virtual_free=270689.761719M,mem_used=3406.660156M, \
swap_used=0.000000M,virtual_used=3406.660156M, \
cpu=0.000000, \
m_topology=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT,
\
m_topology_inuse=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT,
\
m_socket=4,m_core=32,np_load_avg=0.000156, \
np_load_short=0.000000,np_load_medium=0.000156, \
np_load_long=0.000781
processors 64
qname conmat
hostlist node045.cm.cluster
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list mpich mpich2_smpd mvapich openmpi
rerun FALSE
slots 64
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists admins conmat
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
qname test
hostlist node046.cm.cluster node047.cm.cluster
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list mpich mpich2_smpd mvapich openmpi
rerun FALSE
slots 64
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog /cm/shared/apps/sge/current/cm/epilog
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists admins
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
It doesn't matter to which queue I submit the script.
I would use the built in slot complex, but when I use it gives me this
error:
qsub -q conmat -l slots=5 submit.sh
Unable to run job: "job" denied: use parallel environments instead of
requesting slots explicitly.
Exiting.
Regards
El 15/05/12 16:12, William Hay escribió:
Ok that makes more sense. The queue instance on node045 is called
conmat not test. If test only exists as a single slot on each of
node046 and node047
then when you request -q test you are restricting it to those two
slots which isn't enough for a 4 slot job.
We would really need the full output of qstat -f to be sure though.
William
On 15 May 2012 14:42, Arturo<[email protected]> wrote:
More info:
output of qstat -f
---------------------------------------------------------------------------------
[email protected] BIP 0/0/64 0.00 lx26-amd64
---------------------------------------------------------------------------------
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
74550 0.60500 test arturo qw 05/15/2012 15:26:50 4
qconf -sq test |grep slot
slots 64
qconf -sp openmpi |grep slots
slots 99999
urgency_slots min
Regards
El 15/05/12 15:39, Arturo escribió:
Hi William,
you were right, it was running in various nodos:
74545 0.60500 test arturo r 05/15/2012 15:17:46
[email protected] MASTER
[email protected] SLAVE
74545 0.60500 test arturo r 05/15/2012 15:17:46
[email protected] SLAVE
74545 0.60500 test arturo r 05/15/2012 15:17:46
[email protected] SLAVE
Well, looking deeply, the problem is that I created a complex value
"slotsfree" consumable and requestable and I assigned it to the node045 with
the value:
slotsfree=8 (for example).
If I submit a job using a parallel environment to this node without
configuring this complex_value, it works perfectly.
And when I submit a job without using a PE to this node, but with this
complex_value configured, it also works,
but when I submit the same job, using a PE and the complex_value, it doen't
work, and in the output it only says this:
cannot run in PE "openmpi" because it only offers 2 slots
Is it more clear now? Why does not work if I PE is configured without slot
imitation, the node has 64 slots, and the slotsfree value is greated than 4?
Thanks for your help.
Regards
Arturo
El 15/05/12 14:33, William Hay escribió:
On 15 May 2012 13:05, Arturo<[email protected]> wrote:
Hi,
I have a very strange behaviour when I try to use a parallel environment
with hard_queue_list option.
In my script I have a parallel configuration:
#$ -pe openmpi 4
and if submit the script in the following way it works and runs in node
test@node045
qsub script.sh
But If I submit the script using the hard_queue_list it doesn't run:
qsub -q test script.sh
With this error:
cannot run in PE "openmpi" because it only offers 2 slots
Obviously, the node is always empty. What may be wrong?
It's hard to diagnose what's going on without knowing more about your
configuration.
Are you certain the entire job is running in the queue instance
test@node045 when you submit without a queue list?
One possibility is that queue test@node045 has only two slots. The
master slot of the job plus one slave runs
in test@node045 while the remaining slots run elsewhere.
When the job is running what output do you get from qstat -g t?
William
--
Arturo Giner Gracia
HPC research group System Administrator
Instituto de Biocomputación y Física de Sistemas Complejos (BIFI)
Universidad de Zaragoza
e-mail: [email protected]
phone: (+34) 976762992
--
Arturo Giner Gracia
HPC research group System Administrator
Instituto de Biocomputación y Física de Sistemas Complejos (BIFI)
Universidad de Zaragoza
e-mail: [email protected]
phone: (+34) 976762992
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users