Re: [gridengine users] PE offers 0 slots?

Michael Stauffer Fri, 11 Aug 2017 15:45:08 -0700

Hi,

I'm getting back to this post finally. I've looked at the links and
suggestions in the two replies to my original post a few months ago, but
they haven't helped. Here's my original:


I'm getting some queued jobs with scheduling info that includes this line
at the end:

cannot run in PE "unihost" because it only offers 0 slots

'unihost' is the only PE I use. When users request multiple slots, they use
'unihost':

qsub ... -binding linear:2 -pe unihost 2 ...

What happens is that these jobs aren't running when it otherwise seems like
they should be, or they sit waiting in the queue for a long time even when
the user has plenty of quota available within the queue they've requested,
and there are enough resources available on the queue's nodes per
qhost(slots and vmem are consumables), and qquota isn't showing any rqs
limits have been reached.

Below I've dumped relevant configurations.

Today I created a new PE called "int_test" to test the "integer" allocation
rule. I set it to 16 (16 cores per node), and have also tried 8. It's been
added as a PE to the queues we use. When I try to run to this new PE
however, it *always* fails with the same "PE ...offers 0 slots" error, even
if I can run the same multi-slot job using "unihost" PE at the same time.
I'm not sure if this helps debug or not.

Another thought - this behavior started happening some time ago more or
less when I tried implementing fairshare behavior. I never seemed to get
fairshare working right. We haven't been able to confirm, but for some
users it seems this "PE 0 slots" issue pops up only after they've been
running other jobs for a little while. So I'm wondering if I've screwed up
fairshare in some way that's causing this odd behavior.

The default queue from global config file is all.q.

Here are various config dumps. Is there anything else that might be helpful?

Thanks for any help! This has been plaguing me.


[root@chead ~]# qconf -sp unihost

pe_name            unihost
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
qsort_args         NONE


[root@chead ~]# qconf -sp int_test

pe_name            int_test
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    8
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
qsort_args         NONE


[root@chead ~]# qconf -ssconf

algorithm                         default
schedule_interval                 0:0:5
maxujobs                          200
queue_sort_method                 load
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   true
flush_submit_sec                  0
flush_finish_sec                  0
params                            none
reprioritize_interval             0:0:0
halftime                          1
usage_weight_list                 cpu=0.700000,mem=0.200000,io=0.100000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         1000
weight_tickets_share              100000
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   2000
report_pjob_tickets               TRUE
max_pending_tasks_per_job         100
halflife_decay_list               none
policy_hierarchy                  OS
weight_ticket                     0.000000
weight_waiting_time               1.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   1.000000
max_reservation                   0
default_duration                  INFINITY


[root@chead ~]# qconf -sconf

#global:
execd_spool_dir              /opt/sge/default/spool
mailer                       /bin/mail
xterm                        /usr/bin/X11/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,bash,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           02:00:00
loglevel                     log_warning
administrator_mail           none
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 ENABLE_BINDING=true
reporting_params             accounting=true reporting=true \
                             flush_time=00:00:15 joblog=true
sharelog=00:00:00
finished_jobs                100
gid_range                    20000-20100
qlogin_command               /opt/sge/bin/cfn-qlogin.sh
qlogin_daemon                /usr/sbin/sshd -i
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   4000
max_jobs                     0
max_advance_reservations     0
auto_user_oticket            0
auto_user_fshare             100
auto_user_default_project    none
auto_user_delete_time        0
delegated_file_staging       false
reprioritize                 0
jsv_url                      none
jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w


[root@chead ~]# qconf -sq all.q

qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH
ckpt_list             NONE
pe_list               make mpich mpi orte unihost serial int_test unihost2
rerun                 FALSE
slots                 1,[compute-0-0.local=4],[compute-0-1.local=15], \
                      [compute-0-2.local=15],[compute-0-3.local=15], \
                      [compute-0-4.local=15],[compute-0-5.local=15], \
                      [compute-0-6.local=16],[compute-0-7.local=16], \
                      [compute-0-9.local=16],[compute-0-10.local=16], \
                      [compute-0-11.local=16],[compute-0-12.local=16], \
                      [compute-0-13.local=16],[compute-0-14.local=16], \
                      [compute-0-15.local=16],[compute-0-16.local=16], \
                      [compute-0-17.local=16],[compute-0-18.local=16], \
                      [compute-0-8.local=16],[compute-0-19.local=14], \
                      [compute-0-20.local=4],[compute-gpu-0.local=4]
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                40G,[compute-0-20.local=3.2G], \
                      [compute-gpu-0.local=3.2G],[compute-0-19.local=5G]
h_vmem                40G,[compute-0-20.local=3.2G], \
                      [compute-gpu-0.local=3.2G],[compute-0-19.local=5G]


qstat -j on a stuck job as an example:

[mgstauff@chead ~]$ qstat -j 3714924

==============================================================
job_number:                 3714924
exec_file:                  job_scripts/3714924
submission_time:            Fri Aug 11 12:48:47 2017
owner:                      mgstauff
uid:                        2198
group:                      mgstauff
gid:                        2198
sge_o_home:                 /home/mgstauff
sge_o_log_name:             mgstauff
sge_o_path:
/share/apps/mricron/ver_2015_06_01:/share/apps/afni/linux_xorg7_64_2014_06_16:/share/apps/c3d/c3d-1.0.0-Linux-x86_64/bin:/share/apps/freesurfer/5.3.0/bin:/share/apps/freesurfer/5.3.0/fsfast/bin:/share/apps/freesurfer/5.3.0/tktools:/share/apps/fsl/5.0.8/bin:/share/apps/freesurfer/5.3.0/mni/bin:/share/apps/fsl/5.0.8/bin:/share/apps/pandoc/1.12.4.2-in-rstudio/:/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/opt/sge/bin:/opt/sge/bin/lx-amd64:/opt/sge/bin:/opt/sge/bin/lx-amd64:/share/admin:/opt/perfsonar_ps/toolkit/scripts:/usr/dbxml-2.3.11/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/EMBOSS/bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/bio/hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/fasta:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/bio/gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/bio/autodocksuite/bin:/opt/bio/wgs/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/dell/srvadmin/bin:/home/mgstauff/bin:/share/apps/R/R-3.1.1/bin:/share/apps/rstudio/rstudio-0.98.1091/bin/:/share/apps/ANTs/2014-06-23/build/bin/:/share/apps/matlab/R2014b/bin/:/share/apps/BrainVISA/brainvisa-Mandriva-2008.0-x86_64-4.4.0-2013_11_18:/share/apps/MIPAV/7.1.0_release:/share/apps/itksnap/itksnap-most-recent/bin/:/share/apps/MRtrix3/2016-04-25/mrtrix3/release/bin/:/share/apps/VoxBo/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/mgstauff
sge_o_host:                 chead
account:                    sge
hard resource_list:         h_stack=128m
mail_list:                  mgstauff@chead.local
notify:                     FALSE
job_name:                   myjobparam
jobshare:                   0
hard_queue_list:            all.q
env_list:                   TERM=NONE
job_args:                   5
script_file:                workshop-files/myjobparam
parallel environment:  int_test range: 2
binding:                    set linear:2
job_type:                   NONE
scheduling info:            queue instance "gpu.q@compute-gpu-0.local"
dropped because it is temporarily not available

                            queue instance "qlogin.gpu.q@compute-gpu-0.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-18.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-17.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-16.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-13.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-15.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-14.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-12.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-11.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-10.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-9.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-5.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-6.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-7.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-8.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-4.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-2.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-1.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-0.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-20.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-19.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-0-3.local"
dropped because it is temporarily not available
                            queue instance "reboot.q@compute-gpu-0.local"
dropped because it is temporarily not available
                            queue instance "qlogin.long.q@compute-0-20.local"
dropped because it is full
                            queue instance "qlogin.long.q@compute-0-19.local"
dropped because it is full
                            queue instance
"qlogin.long.q@compute-gpu-0.local" dropped because it is full
                            queue instance "basic.q@compute-1-2.local"
dropped because it is full
                            queue instance "himem.q@compute-0-13.local"
dropped because it is full
                            queue instance "himem.q@compute-0-4.local"
dropped because it is full
                            queue instance "himem.q@compute-0-2.local"
dropped because it is full
                            queue instance "himem.q@compute-0-12.local"
dropped because it is full
                            queue instance "himem.q@compute-0-17.local"
dropped because it is full
                            queue instance "himem.q@compute-0-3.local"
dropped because it is full
                            queue instance "himem.q@compute-0-8.local"
dropped because it is full
                            queue instance "himem.q@compute-0-5.local"
dropped because it is full
                            queue instance "himem.q@compute-0-11.local"
dropped because it is full
                            queue instance "himem.q@compute-0-15.local"
dropped because it is full
                            queue instance "himem.q@compute-0-7.local"
dropped because it is full
                            queue instance "himem.q@compute-0-14.local"
dropped because it is full
                            queue instance "himem.q@compute-0-18.local"
dropped because it is full
                            queue instance "himem.q@compute-0-10.local"
dropped because it is full
                            queue instance "himem.q@compute-0-6.local"
dropped because it is full
                            queue instance "himem.q@compute-gpu-0.local"
dropped because it is full
                            queue instance "himem.q@compute-0-16.local"
dropped because it is full
                            queue instance "himem.q@compute-0-9.local"
dropped because it is full
                            queue instance "himem.q@compute-0-0.local"
dropped because it is full
                            queue instance "himem.q@compute-0-1.local"
dropped because it is full
                            queue instance
"qlogin.himem.q@compute-0-13.local" dropped because it is full
                            queue instance "qlogin.himem.q@compute-0-4.local"
dropped because it is full
                            queue instance "qlogin.himem.q@compute-0-2.local"
dropped because it is full
                            queue instance
"qlogin.himem.q@compute-0-12.local" dropped because it is full
                            queue instance
"qlogin.himem.q@compute-0-17.local" dropped because it is full
                            queue instance "qlogin.himem.q@compute-0-3.local"
dropped because it is full
                            queue instance "qlogin.himem.q@compute-0-8.local"
dropped because it is full
                            queue instance "qlogin.himem.q@compute-0-5.local"
dropped because it is full
                            queue instance
"qlogin.himem.q@compute-0-11.local" dropped because it is full
                            queue instance
"qlogin.himem.q@compute-0-15.local" dropped because it is full
                            queue instance "qlogin.himem.q@compute-0-7.local"
dropped because it is full
                            queue instance
"qlogin.himem.q@compute-0-14.local" dropped because it is full
                            queue instance
"qlogin.himem.q@compute-0-18.local" dropped because it is full
                            queue instance
"qlogin.himem.q@compute-0-10.local" dropped because it is full
                            queue instance "qlogin.himem.q@compute-0-6.local"
dropped because it is full
                            queue instance
"qlogin.himem.q@compute-gpu-0.local" dropped because it is full
                            queue instance
"qlogin.himem.q@compute-0-16.local" dropped because it is full
                            queue instance "qlogin.himem.q@compute-0-9.local"
dropped because it is full
                            queue instance "qlogin.himem.q@compute-0-0.local"
dropped because it is full
                            queue instance "qlogin.himem.q@compute-0-1.local"
dropped because it is full
                            queue instance "qlogin.q@compute-0-20.local"
dropped because it is full
                            queue instance "qlogin.q@compute-0-19.local"
dropped because it is full
                            queue instance "qlogin.q@compute-gpu-0.local"
dropped because it is full
                            queue instance "qlogin.q@compute-0-7.local"
dropped because it is full
                            queue instance "all.q@compute-0-0.local"
dropped because it is full
                            cannot run in PE "int_test" because it only
offers 0 slots

[mgstauff@chead ~]$ qquota -u mgstauff

resource quota rule limit                filter
--------------------------------------------------------------------------------


[mgstauff@chead ~]$ qconf -srqs limit_user_slots

{
   name         limit_user_slots
   description  Limit the users' batch slots
   enabled      TRUE
   limit        users {pcook,mgstauff} queues {allalt.q} to slots=32
   limit        users {*} queues {allalt.q} to slots=0
   limit        users {*} queues {himem.q} to slots=6
   limit        users {*} queues {all.q,himem.q} to slots=32
   limit        users {*} queues {basic.q} to slots=40
}


There are plenty of consumables available:

[root@chead ~]# qstat -F h_vmem,s_vmem,slots -q all.q a

queuename                      qtype resv/used/tot. load_avg arch
 states
---------------------------------------------------------------------------------
all.q@compute-0-0.local        BP    0/4/4          5.24     lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        qc:slots=0
---------------------------------------------------------------------------------
all.q@compute-0-1.local        BP    0/10/15        9.58     lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        qc:slots=5
---------------------------------------------------------------------------------
all.q@compute-0-10.local       BP    0/9/16         9.80     lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        hc:slots=7
---------------------------------------------------------------------------------
all.q@compute-0-11.local       BP    0/11/16        9.18     lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        hc:slots=5
---------------------------------------------------------------------------------
all.q@compute-0-12.local       BP    0/11/16        9.72     lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        hc:slots=5
---------------------------------------------------------------------------------
all.q@compute-0-13.local       BP    0/10/16        9.14     lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        hc:slots=6
---------------------------------------------------------------------------------
all.q@compute-0-14.local       BP    0/10/16        9.66     lx-amd64
        hc:h_vmem=28.890G
        hc:s_vmem=30.990G
        hc:slots=6
---------------------------------------------------------------------------------
all.q@compute-0-15.local       BP    0/10/16        9.54     lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        hc:slots=6
---------------------------------------------------------------------------------
all.q@compute-0-16.local       BP    0/10/16        10.01    lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        hc:slots=6
---------------------------------------------------------------------------------
all.q@compute-0-17.local       BP    0/11/16        9.75     lx-amd64
        hc:h_vmem=29.963G
        hc:s_vmem=32.960G
        hc:slots=5
---------------------------------------------------------------------------------
all.q@compute-0-18.local       BP    0/11/16        10.29    lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        hc:slots=5
---------------------------------------------------------------------------------
all.q@compute-0-19.local       BP    0/9/14         9.01     lx-amd64
        qf:h_vmem=5.000G
        qf:s_vmem=5.000G
        qc:slots=5
---------------------------------------------------------------------------------
all.q@compute-0-2.local        BP    0/10/15        9.24     lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        qc:slots=5
---------------------------------------------------------------------------------
all.q@compute-0-20.local       BP    0/0/4          0.00     lx-amd64
        qf:h_vmem=3.200G
        qf:s_vmem=3.200G
        qc:slots=4
---------------------------------------------------------------------------------
all.q@compute-0-3.local        BP    0/11/15        9.62     lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        qc:slots=4
---------------------------------------------------------------------------------
all.q@compute-0-4.local        BP    0/12/15        9.85     lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        qc:slots=3
---------------------------------------------------------------------------------
all.q@compute-0-5.local        BP    0/12/15        10.18    lx-amd64
        hc:h_vmem=36.490G
        hc:s_vmem=39.390G
        qc:slots=3
---------------------------------------------------------------------------------
all.q@compute-0-6.local        BP    0/12/16        9.95     lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        hc:slots=4
---------------------------------------------------------------------------------
all.q@compute-0-7.local        BP    0/10/16        9.59     lx-amd64
        hc:h_vmem=36.935G
        qf:s_vmem=40.000G
        hc:slots=5
---------------------------------------------------------------------------------
all.q@compute-0-8.local        BP    0/10/16        9.37     lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        hc:slots=6
---------------------------------------------------------------------------------
all.q@compute-0-9.local        BP    0/10/16        9.38     lx-amd64
        qf:h_vmem=40.000G
        qf:s_vmem=40.000G
        hc:slots=6
---------------------------------------------------------------------------------
all.q@compute-gpu-0.local      BP    0/0/4          0.05     lx-amd64
        qf:h_vmem=3.200G
        qf:s_vmem=3.200G
        qc:slots=4



On Mon, Feb 13, 2017 at 2:42 PM, Jesse Becker <becke...@mail.nih.gov> wrote:

> On Mon, Feb 13, 2017 at 02:26:18PM -0500, Michael Stauffer wrote:
>
>> SoGE 8.1.8
>>
>> Hi,
>>
>> I'm getting some queued jobs with scheduling info that includes this line
>> at the end:
>>
>> cannot run in PE "unihost" because it only offers 0 slots
>>
>> 'unihost' is the only PE I use. When users request multiple slots, they
>> use
>> 'unihost':
>>
>> ... -binding linear:2 -pe unihost 2 ...
>>
>> What happens is that these jobs aren't running when it otherwise seems
>> like
>> they should be, or they sit waiting in the queue for a long time even when
>> the user has plenty of quota available within the queue they've requested,
>> and there are enough resources available on the queue's nodes (slots and
>> vram are consumables).
>>
>> Any suggestions about how I might further understand this?
>>
>
> This *exact* problem has bitten me in the past.  It seems to crop up
> about every 3 years--long enough to remember it was a problem, and long
> enough to forget just what the [censored] I did to fix it.
>
> As I recall, it has little to do with actual PEs, but everything to do
> with complexes and resource requests.
>
> You might glean a bit more information by running "qsub -w p" (or "-w e").
>
> Take a look at these previous discussions:
>
> http://gridengine.org/pipermail/users/2011-November/001932.html
> http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/1700
>
>
> --
> Jesse Becker (Contractor)
>

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] PE offers 0 slots?

Reply via email to