Re: [gridengine users] hard_queue_list & parallel environment error

William Hay Tue, 15 May 2012 07:51:58 -0700

On 15 May 2012 15:23, Arturo <[email protected]> wrote:
> Sorry but that is not the problem, every node is configured with 64 slots:
>
> hostname              node045.cm.cluster
> load_scaling          NONE
> complex_values        slotsfree=8,ngpus=0
> load_values           arch=lx26-amd64,num_proc=64,mem_total=258468.914062M,
> \
>                       swap_total=31250.992188M,virtual_total=289719.906250M,
> \
>                       load_avg=0.030000,load_short=0.000000, \
>                       load_medium=0.030000,load_long=0.000000, \
>                       mem_free=255753.085938M,swap_free=31250.992188M, \
>                       virtual_free=287004.078125M,mem_used=2715.828125M, \
>                       swap_used=0.000000M,virtual_used=2715.828125M, \
>                       cpu=0.000000, \
>
> m_topology=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT,
> \
>
> m_topology_inuse=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT,
> \
>                       m_socket=4,m_core=32,np_load_avg=0.000469, \
>                       np_load_short=0.000000,np_load_medium=0.000469, \
>                       np_load_long=0.000000
> processors            64
>
>
> hostname              node046.cm.cluster
> load_scaling          NONE
> complex_values        slotsfree=8,ngpus=0
> load_values           arch=lx26-amd64,num_proc=64,mem_total=258472.425781M,
> \
>                       swap_total=15623.996094M,virtual_total=274096.421875M,
> \
>                       load_avg=0.010000,load_short=0.000000, \
>                       load_medium=0.010000,load_long=0.050000, \
>                       mem_free=255065.765625M,swap_free=15623.996094M, \
>                       virtual_free=270689.761719M,mem_used=3406.660156M, \
>                       swap_used=0.000000M,virtual_used=3406.660156M, \
>                       cpu=0.000000, \
>
> m_topology=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT,
> \
>
> m_topology_inuse=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT,
> \
>                       m_socket=4,m_core=32,np_load_avg=0.000156, \
>                       np_load_short=0.000000,np_load_medium=0.000156, \
>                       np_load_long=0.000781
> processors            64
>
> qname                 conmat
> hostlist              node045.cm.cluster
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               mpich mpich2_smpd mvapich openmpi
> rerun                 FALSE
> slots                 64
> tmpdir                /tmp
> shell                 /bin/csh
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            admins conmat
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
>
>
> qname                 test
> hostlist              node046.cm.cluster node047.cm.cluster
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               mpich mpich2_smpd mvapich openmpi
> rerun                 FALSE
> slots                 64
> tmpdir                /tmp
> shell                 /bin/bash
> prolog                NONE
> epilog                /cm/shared/apps/sge/current/cm/epilog
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            admins
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
>
>
>
> It doesn't matter to which queue I submit the script.
>
> I would use the built in slot complex, but when I use it gives me this
> error:
>
> qsub -q conmat -l slots=5 submit.sh
> Unable to run job: "job" denied: use parallel environments instead of
> requesting slots explicitly.
> Exiting.


What exactly is "slotsfree" supposed to represent though?  It looks
like the reason most of your jobs don't schedule
is that you don't have enough "slotsfree".  The reason for that is
that your "slotsfree" consumption is multiplied by the
number of slots assigned to each host.




>
>
> Regards
>
> El 15/05/12 16:12, William Hay escribió:
>
> Ok that makes more sense.  The queue instance on node045 is called
> conmat not test.   If test only exists as a single slot on each of
> node046 and node047
> then when you request -q test you are restricting it to those two
> slots which isn't enough for a 4 slot job.
> We would really need the full output of qstat -f to be sure though.
>
>
> William
> On 15 May 2012 14:42, Arturo <[email protected]> wrote:
>
> More info:
>
> output of qstat -f
>
> ---------------------------------------------------------------------------------
> [email protected]      BIP   0/0/64         0.00     lx26-amd64
> ---------------------------------------------------------------------------------
>
> ############################################################################
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
> ############################################################################
>   74550 0.60500 test     arturo       qw    05/15/2012 15:26:50     4
>
> qconf -sq test |grep slot
>
>     slots                 64
>
>
> qconf -sp openmpi |grep slots
>
> slots              99999
> urgency_slots      min
>
> Regards
>
> El 15/05/12 15:39, Arturo escribió:
>
> Hi William,
>
> you were right, it was running in various nodos:
>
>   74545 0.60500 test     arturo       r     05/15/2012 15:17:46
> [email protected]      MASTER
>
>                                 [email protected]      SLAVE
>   74545 0.60500 test     arturo       r     05/15/2012 15:17:46
> [email protected]        SLAVE
>   74545 0.60500 test     arturo       r     05/15/2012 15:17:46
> [email protected]        SLAVE
>
> Well, looking deeply, the problem is that I created a complex value
> "slotsfree" consumable and requestable and I assigned it to the node045 with
> the value:
> slotsfree=8 (for example).
>
> If I submit a job using a parallel environment to this node without
> configuring this complex_value, it works perfectly.
> And when I submit a job without using a PE to this node, but with this
> complex_value configured, it also works,
> but when I submit the same job, using a PE  and the complex_value, it doen't
> work, and in the output it only says this:
>
> cannot run in PE "openmpi" because it only offers 2 slots
>
>
> Is it more clear now? Why does not work if I PE is configured without slot
> imitation, the node has 64 slots, and the slotsfree value is greated than 4?
>
> Thanks for your help.
>
> Regards
> Arturo
>
>
> El 15/05/12 14:33, William Hay escribió:
>
> On 15 May 2012 13:05, Arturo<[email protected]>  wrote:
>
> Hi,
>
> I have a very strange behaviour when I try to use a parallel environment
> with hard_queue_list option.
>
> In my script I have a parallel configuration:
>
>      #$ -pe openmpi 4
>
> and if submit the script in the following way it works and runs in node
> test@node045
>
>      qsub script.sh
>
> But If I submit the script using the hard_queue_list it doesn't run:
>
>      qsub -q test script.sh
>
> With this error:
>
>      cannot run in PE "openmpi" because it only offers 2 slots
>
> Obviously, the node is always empty. What may be wrong?
>
> It's hard to diagnose what's going on without knowing more about your
> configuration.
> Are you certain the entire job is running in the queue instance
> test@node045 when you submit without a queue list?
> One possibility is that queue test@node045 has only two slots.  The
> master slot of the job plus one slave runs
> in test@node045 while the remaining slots run elsewhere.
>
> When the job is running what output do you get from qstat -g t?
>
> William
>
>
>
>
>
> --
> Arturo Giner Gracia
> HPC research group System Administrator
> Instituto de Biocomputación y Física de Sistemas Complejos (BIFI)
> Universidad de Zaragoza
> e-mail: [email protected]
> phone: (+34) 976762992
>
>
>
> --
> Arturo Giner Gracia
> HPC research group System Administrator
> Instituto de Biocomputación y Física de Sistemas Complejos (BIFI)
> Universidad de Zaragoza
> e-mail: [email protected]
> phone: (+34) 976762992

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] hard_queue_list & parallel environment error

Reply via email to