Re: [gridengine users] PE offers 0 slots?

Michael Stauffer Mon, 14 Aug 2017 10:48:02 -0700

I have some more information.

We have two sets of exec hosts on the cluster, one in a the host
group/hostlist "@allhosts" that is assigned to the queue all.q. The other
is in the group "@basichosts", which is assigned to a queue called basic.q


When we're having the trouble with multi-slot/core jobs not running for a
user on all.q, the same jobs can be resubmitted (or added via qalter) to
basic.q, and they will run immediately.

I made a duplicate queue of all.q, called allalt.q. The same problem
happens with jobs getting stuck in queue. When I change the hostlist in
allalt.q, and nothing else, from @allhosts to @basichosts, the stuck jobs
run immediately. (Again, this is happenning when there are plenty of
resources reported available on all.q hosts, and the user's quotas are
either empty or not maxed.)

Here's the definitions of a host from each of the groups:

A host from all.q's group, @allhosts, where jobs get stuck:
[root@chead ~]# qconf -se compute-0-1

hostname              compute-0-1.local
load_scaling          NONE
complex_values        h_vmem=125.49G,s_vmem=125.49G,slots=16.000000
load_values           arch=lx-amd64,num_proc=16,mem_total=64508.523438M, \
                      swap_total=31999.996094M,virtual_total=96508.519531M,
\
                      m_topology=SCCCCCCCCSCCCCCCCC,m_socket=2,m_core=16, \
                      m_thread=16,load_avg=7.590000,load_short=7.660000, \
                      load_medium=7.590000,load_long=7.300000, \
                      mem_free=53815.035156M,swap_free=31834.675781M, \
                      virtual_free=85649.710938M,mem_used=10693.488281M, \
                      swap_used=165.320312M,virtual_used=10858.808594M, \
                      cpu=42.800000,m_topology_inuse=SccccCCCCSccCccCCC, \
                      np_load_avg=0.474375,np_load_short=0.478750, \
                      np_load_medium=0.474375,np_load_long=0.456250
processors            16
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE


And a host from basic.q's group, @basichosts, where jobs run immediately:
[root@chead ~]# qconf -se compute-1-0

hostname              compute-1-0.local
load_scaling          NONE
complex_values        h_vmem=19.02G,s_vmem=19.02G,slots=8.000000
load_values           arch=lx-amd64,num_proc=8,mem_total=16077.441406M, \
                      swap_total=3999.996094M,virtual_total=20077.437500M, \
                      m_topology=SCCCCSCCCC,m_socket=2,m_core=8,m_thread=8,
\
                      load_avg=1.680000,load_short=2.420000, \
                      load_medium=1.680000,load_long=1.790000, \
                      mem_free=13408.687500M,swap_free=3973.464844M, \
                      virtual_free=17382.152344M,mem_used=2668.753906M, \
                      swap_used=26.531250M,virtual_used=2695.285156M, \
                      cpu=16.400000,m_topology_inuse=SccCCScCCC, \
                      np_load_avg=0.210000,np_load_short=0.302500, \
                      np_load_medium=0.210000,np_load_long=0.223750
processors            8
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE


Here's the full complex config.
'slots' are listed as "YES" under consumable, whereas s_vmem and h_vmem are
listed as "JOB". Seems this should be OK, but maybe not? Also 'slots' has
urgency 1000, whereas others have 0.

[root@chead ~]# qconf -sc

#name               shortcut   type        relop requestable consumable
default  urgency
#----------------------------------------------------------------------------------------
arch                a          RESTRING    ==    YES         NO
NONE     0
calendar            c          RESTRING    ==    YES         NO
NONE     0
cpu                 cpu        DOUBLE      >=    YES         NO         0
     0
display_win_gui     dwg        BOOL        ==    YES         NO         0
     0
h_core              h_core     MEMORY      <=    YES         NO         0
     0
h_cpu               h_cpu      TIME        <=    YES         NO
0:0:0    0
h_data              h_data     MEMORY      <=    YES         NO         0
     0
h_fsize             h_fsize    MEMORY      <=    YES         NO         0
     0
h_rss               h_rss      MEMORY      <=    YES         NO         0
     0
h_rt                h_rt       TIME        <=    YES         NO
0:0:0    0
h_stack             h_stack    MEMORY      <=    YES         NO         0
     0
h_vmem              h_vmem     MEMORY      <=    YES         JOB
 3100M    0
hostname            h          HOST        ==    YES         NO
NONE     0
load_avg            la         DOUBLE      >=    NO          NO         0
     0
load_long           ll         DOUBLE      >=    NO          NO         0
     0
load_medium         lm         DOUBLE      >=    NO          NO         0
     0
load_short          ls         DOUBLE      >=    NO          NO         0
     0
m_core              core       INT         <=    YES         NO         0
     0
m_socket            socket     INT         <=    YES         NO         0
     0
m_thread            thread     INT         <=    YES         NO         0
     0
m_topology          topo       RESTRING    ==    YES         NO
NONE     0
m_topology_inuse    utopo      RESTRING    ==    YES         NO
NONE     0
mem_free            mf         MEMORY      <=    YES         NO         0
     0
mem_total           mt         MEMORY      <=    YES         NO         0
     0
mem_used            mu         MEMORY      >=    YES         NO         0
     0
min_cpu_interval    mci        TIME        <=    NO          NO
0:0:0    0
np_load_avg         nla        DOUBLE      >=    NO          NO         0
     0
np_load_long        nll        DOUBLE      >=    NO          NO         0
     0
np_load_medium      nlm        DOUBLE      >=    NO          NO         0
     0
np_load_short       nls        DOUBLE      >=    NO          NO         0
     0
num_proc            p          INT         ==    YES         NO         0
     0
qname               q          RESTRING    ==    YES         NO
NONE     0
rerun               re         BOOL        ==    NO          NO         0
     0
s_core              s_core     MEMORY      <=    YES         NO         0
     0
s_cpu               s_cpu      TIME        <=    YES         NO
0:0:0    0
s_data              s_data     MEMORY      <=    YES         NO         0
     0
s_fsize             s_fsize    MEMORY      <=    YES         NO         0
     0
s_rss               s_rss      MEMORY      <=    YES         NO         0
     0
s_rt                s_rt       TIME        <=    YES         NO
0:0:0    0
s_stack             s_stack    MEMORY      <=    YES         NO         0
     0
s_vmem              s_vmem     MEMORY      <=    YES         JOB
 3000M    0
seq_no              seq        INT         ==    NO          NO         0
     0
slots               s          INT         <=    YES         YES        1
     1000
swap_free           sf         MEMORY      <=    YES         NO         0
     0
swap_rate           sr         MEMORY      >=    YES         NO         0
     0
swap_rsvd           srsv       MEMORY      >=    YES         NO         0
     0
swap_total          st         MEMORY      <=    YES         NO         0
     0
swap_used           su         MEMORY      >=    YES         NO         0
     0
tmpdir              tmp        RESTRING    ==    NO          NO
NONE     0
virtual_free        vf         MEMORY      <=    YES         NO         0
     0
virtual_total       vt         MEMORY      <=    YES         NO         0
     0
virtual_used        vu         MEMORY      >=    YES         NO         0
     0




Does this info help at all in diagnosing? Any other config info that could
help understand this?

-M



On Sun, Aug 13, 2017 at 12:11 PM, Michael Stauffer <mgsta...@gmail.com>
wrote:

> Thanks for the reply Reuti, see below
>
> On Fri, Aug 11, 2017 at 7:18 PM, Reuti <re...@staff.uni-marburg.de> wrote:
>
>>
>> What I notice below: defining h_vmem/s_vmem on a queue level means per
>> job. Defining it on an exechost level means across all jobs. What is
>> different between:
>>
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-13.local       BP    0/10/16        9.14     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         hc:slots=6
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-14.local       BP    0/10/16        9.66     lx-amd64
>> >         hc:h_vmem=28.890G
>> >         hc:s_vmem=30.990G
>> >         hc:slots=6
>>
>>
>> qf = queue fixed
>> hc = host consumable
>>
>> What is the definition of h_vmem/s_vmem in `qconf -sc` and their default
>> consumptions?
>>
>
> I thought this means that when it's showing qf, it's the per-job queue
> limit, i.e. the queue has a h_vmem and s_vmem limits for the job of 40G
> (which it does). And then hc is shown when the host resources are less than
> the per-job queue limit.
>
> [root@chead ~]# qconf -sc | grep vmem
> h_vmem              h_vmem     MEMORY      <=    YES         JOB
>  3100M    0
> s_vmem              s_vmem     MEMORY      <=    YES         JOB
>  3000M    0
>
> > 'unihost' is the only PE I use. When users request multiple slots, they
>> use 'unihost':
>> >
>> > qsub ... -binding linear:2 -pe unihost 2 ...
>> >
>> > What happens is that these jobs aren't running when it otherwise seems
>> like they should be, or they sit waiting in the queue for a long time even
>> when the user has plenty of quota available within the queue they've
>> requested, and there are enough resources available on the queue's nodes
>> per qhost(slots and vmem are consumables), and qquota isn't showing any rqs
>> limits have been reached.
>> >
>> > Below I've dumped relevant configurations.
>> >
>> > Today I created a new PE called "int_test" to test the "integer"
>> allocation rule. I set it to 16 (16 cores per node), and have also tried 8.
>> It's been added as a PE to the queues we use. When I try to run to this new
>> PE however, it *always* fails with the same "PE ...offers 0 slots" error,
>> even if I can run the same multi-slot job using "unihost" PE at the same
>> time. I'm not sure if this helps debug or not.
>> >
>> > Another thought - this behavior started happening some time ago more or
>> less when I tried implementing fairshare behavior. I never seemed to get
>> fairshare working right. We haven't been able to confirm, but for some
>> users it seems this "PE 0 slots" issue pops up only after they've been
>> running other jobs for a little while. So I'm wondering if I've screwed up
>> fairshare in some way that's causing this odd behavior.
>> >
>> > The default queue from global config file is all.q.
>>
>> There is no default queue in SGE. One specifies resource requests and SGE
>> will select an appropriate one. What do you refer to by this?
>>
>> Do you have any sge_request or private .sge_request?
>>
>
> Yes, the global sge_request has '-q all.q'. I can't remember why this was
> done when I first set things up years ago - I think the cluster I was
> migrating from was set up that way and I just copied it.
>
> Given my qconf '-ssconf' and '-sconf' output below, does something look
> off with my fairshare setup (and subsequent attempt to disable it)? As I
> mentioned, I'm wondering if something went wrong with how I set it up
> because this intermittent behavior may have started at the same time.
>
> -M
>
> >
>> > Here are various config dumps. Is there anything else that might be
>> helpful?
>> >
>> > Thanks for any help! This has been plaguing me.
>> >
>> >
>> > [root@chead ~]# qconf -sp unihost
>> > pe_name            unihost
>> > slots              9999
>> > user_lists         NONE
>> > xuser_lists        NONE
>> > start_proc_args    /bin/true
>> > stop_proc_args     /bin/true
>> > allocation_rule    $pe_slots
>> > control_slaves     FALSE
>> > job_is_first_task  TRUE
>> > urgency_slots      min
>> > accounting_summary FALSE
>> > qsort_args         NONE
>> >
>> > [root@chead ~]# qconf -sp int_test
>> > pe_name            int_test
>> > slots              9999
>> > user_lists         NONE
>> > xuser_lists        NONE
>> > start_proc_args    /bin/true
>> > stop_proc_args     /bin/true
>> > allocation_rule    8
>> > control_slaves     FALSE
>> > job_is_first_task  TRUE
>> > urgency_slots      min
>> > accounting_summary FALSE
>> > qsort_args         NONE
>> >
>> > [root@chead ~]# qconf -ssconf
>> > algorithm                         default
>> > schedule_interval                 0:0:5
>> > maxujobs                          200
>> > queue_sort_method                 load
>> > job_load_adjustments              np_load_avg=0.50
>> > load_adjustment_decay_time        0:7:30
>> > load_formula                      np_load_avg
>> > schedd_job_info                   true
>> > flush_submit_sec                  0
>> > flush_finish_sec                  0
>> > params                            none
>> > reprioritize_interval             0:0:0
>> > halftime                          1
>> > usage_weight_list                 cpu=0.700000,mem=0.200000,io=0.100000
>> > compensation_factor               5.000000
>> > weight_user                       0.250000
>> > weight_project                    0.250000
>> > weight_department                 0.250000
>> > weight_job                        0.250000
>> > weight_tickets_functional         1000
>> > weight_tickets_share              100000
>> > share_override_tickets            TRUE
>> > share_functional_shares           TRUE
>> > max_functional_jobs_to_schedule   2000
>> > report_pjob_tickets               TRUE
>> > max_pending_tasks_per_job         100
>> > halflife_decay_list               none
>> > policy_hierarchy                  OS
>> > weight_ticket                     0.000000
>> > weight_waiting_time               1.000000
>> > weight_deadline                   3600000.000000
>> > weight_urgency                    0.100000
>> > weight_priority                   1.000000
>> > max_reservation                   0
>> > default_duration                  INFINITY
>> >
>> > [root@chead ~]# qconf -sconf
>> > #global:
>> > execd_spool_dir              /opt/sge/default/spool
>> > mailer                       /bin/mail
>> > xterm                        /usr/bin/X11/xterm
>> > load_sensor                  none
>> > prolog                       none
>> > epilog                       none
>> > shell_start_mode             posix_compliant
>> > login_shells                 sh,bash,ksh,csh,tcsh
>> > min_uid                      0
>> > min_gid                      0
>> > user_lists                   none
>> > xuser_lists                  none
>> > projects                     none
>> > xprojects                    none
>> > enforce_project              false
>> > enforce_user                 auto
>> > load_report_time             00:00:40
>> > max_unheard                  00:05:00
>> > reschedule_unknown           02:00:00
>> > loglevel                     log_warning
>> > administrator_mail           none
>> > set_token_cmd                none
>> > pag_cmd                      none
>> > token_extend_time            none
>> > shepherd_cmd                 none
>> > qmaster_params               none
>> > execd_params                 ENABLE_BINDING=true
>> > reporting_params             accounting=true reporting=true \
>> >                              flush_time=00:00:15 joblog=true
>> sharelog=00:00:00
>> > finished_jobs                100
>> > gid_range                    20000-20100
>> > qlogin_command               /opt/sge/bin/cfn-qlogin.sh
>> > qlogin_daemon                /usr/sbin/sshd -i
>> > rlogin_command               builtin
>> > rlogin_daemon                builtin
>> > rsh_command                  builtin
>> > rsh_daemon                   builtin
>> > max_aj_instances             2000
>> > max_aj_tasks                 75000
>> > max_u_jobs                   4000
>> > max_jobs                     0
>> > max_advance_reservations     0
>> > auto_user_oticket            0
>> > auto_user_fshare             100
>> > auto_user_default_project    none
>> > auto_user_delete_time        0
>> > delegated_file_staging       false
>> > reprioritize                 0
>> > jsv_url                      none
>> > jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>> >
>> > [root@chead ~]# qconf -sq all.q
>> > qname                 all.q
>> > hostlist              @allhosts
>> > seq_no                0
>> > load_thresholds       np_load_avg=1.75
>> > suspend_thresholds    NONE
>> > nsuspend              1
>> > suspend_interval      00:05:00
>> > priority              0
>> > min_cpu_interval      00:05:00
>> > processors            UNDEFINED
>> > qtype                 BATCH
>> > ckpt_list             NONE
>> > pe_list               make mpich mpi orte unihost serial int_test
>> unihost2
>> > rerun                 FALSE
>> > slots                 1,[compute-0-0.local=4],[compute-0-1.local=15], \
>> >                       [compute-0-2.local=15],[compute-0-3.local=15], \
>> >                       [compute-0-4.local=15],[compute-0-5.local=15], \
>> >                       [compute-0-6.local=16],[compute-0-7.local=16], \
>> >                       [compute-0-9.local=16],[compute-0-10.local=16], \
>> >                       [compute-0-11.local=16],[compute-0-12.local=16],
>> \
>> >                       [compute-0-13.local=16],[compute-0-14.local=16],
>> \
>> >                       [compute-0-15.local=16],[compute-0-16.local=16],
>> \
>> >                       [compute-0-17.local=16],[compute-0-18.local=16],
>> \
>> >                       [compute-0-8.local=16],[compute-0-19.local=14], \
>> >                       [compute-0-20.local=4],[compute-gpu-0.local=4]
>> > tmpdir                /tmp
>> > shell                 /bin/bash
>> > prolog                NONE
>> > epilog                NONE
>> > shell_start_mode      posix_compliant
>> > starter_method        NONE
>> > suspend_method        NONE
>> > resume_method         NONE
>> > terminate_method      NONE
>> > notify                00:00:60
>> > owner_list            NONE
>> > user_lists            NONE
>> > xuser_lists           NONE
>> > subordinate_list      NONE
>> > complex_values        NONE
>> > projects              NONE
>> > xprojects             NONE
>> > calendar              NONE
>> > initial_state         default
>> > s_rt                  INFINITY
>> > h_rt                  INFINITY
>> > s_cpu                 INFINITY
>> > h_cpu                 INFINITY
>> > s_fsize               INFINITY
>> > h_fsize               INFINITY
>> > s_data                INFINITY
>> > h_data                INFINITY
>> > s_stack               INFINITY
>> > h_stack               INFINITY
>> > s_core                INFINITY
>> > h_core                INFINITY
>> > s_rss                 INFINITY
>> > h_rss                 INFINITY
>> > s_vmem                40G,[compute-0-20.local=3.2G], \
>> >                       [compute-gpu-0.local=3.2G],[c
>> ompute-0-19.local=5G]
>> > h_vmem                40G,[compute-0-20.local=3.2G], \
>> >                       [compute-gpu-0.local=3.2G],[c
>> ompute-0-19.local=5G]
>> >
>> > qstat -j on a stuck job as an example:
>> >
>> > [mgstauff@chead ~]$ qstat -j 3714924
>> > ==============================================================
>> > job_number:                 3714924
>> > exec_file:                  job_scripts/3714924
>> > submission_time:            Fri Aug 11 12:48:47 2017
>> > owner:                      mgstauff
>> > uid:                        2198
>> > group:                      mgstauff
>> > gid:                        2198
>> > sge_o_home:                 /home/mgstauff
>> > sge_o_log_name:             mgstauff
>> > sge_o_path:                 /share/apps/mricron/ver_2015_
>> 06_01:/share/apps/afni/linux_xorg7_64_2014_06_16:/share/apps
>> /c3d/c3d-1.0.0-Linux-x86_64/bin:/share/apps/freesurfer/5.
>> 3.0/bin:/share/apps/freesurfer/5.3.0/fsfast/bin:/share/apps/
>> freesurfer/5.3.0/tktools:/share/apps/fsl/5.0.8/bin:/
>> share/apps/freesurfer/5.3.0/mni/bin:/share/apps/fsl/5.0.8/
>> bin:/share/apps/pandoc/1.12.4.2-in-rstudio/:/opt/openmpi/
>> bin:/usr/lib64/qt-3.3/bin:/opt/sge/bin:/opt/sge/bin/lx-
>> amd64:/opt/sge/bin:/opt/sge/bin/lx-amd64:/share/admin:/
>> opt/perfsonar_ps/toolkit/scripts:/usr/dbxml-2.3.11/bin:/usr/
>> local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/
>> opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/EMBOSS/
>> bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/bio/
>> hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/fast
>> a:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/bio/
>> gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/bio/
>> autodocksuite/bin:/opt/bio/wgs/bin:/opt/ganglia/bin:/opt/
>> ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/pdsh/
>> bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/dell/srvadmin/bin:/
>> home/mgstauff/bin:/share/apps/R/R-3.1.1/bin:/share/apps/
>> rstudio/rstudio-0.98.1091/bin/:/share/apps/ANTs/2014-06-23/
>> build/bin/:/share/apps/matlab/R2014b/bin/:/share/apps/
>> BrainVISA/brainvisa-Mandriva-2008.0-x86_64-4.4.0-2013_11_
>> 18:/share/apps/MIPAV/7.1.0_release:/share/apps/itksnap/
>> itksnap-most-recent/bin/:/share/apps/MRtrix3/2016-04-25/
>> mrtrix3/release/bin/:/share/apps/VoxBo/bin
>> > sge_o_shell:                /bin/bash
>> > sge_o_workdir:              /home/mgstauff
>> > sge_o_host:                 chead
>> > account:                    sge
>> > hard resource_list:         h_stack=128m
>> > mail_list:                  mgstauff@chead.local
>> > notify:                     FALSE
>> > job_name:                   myjobparam
>> > jobshare:                   0
>> > hard_queue_list:            all.q
>> > env_list:                   TERM=NONE
>> > job_args:                   5
>> > script_file:                workshop-files/myjobparam
>> > parallel environment:  int_test range: 2
>> > binding:                    set linear:2
>> > job_type:                   NONE
>> > scheduling info:            queue instance "gpu.q@compute-gpu-0.local"
>> dropped because it is temporarily not available
>> >                             queue instance
>> "qlogin.gpu.q@compute-gpu-0.local" dropped because it is temporarily not
>> available
>> >                             queue instance "reboot.q@compute-0-18.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-17.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-16.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-13.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-15.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-14.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-12.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-11.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-10.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-9.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-5.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-6.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-7.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-8.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-4.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-2.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-1.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-0.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-20.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-19.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-0-3.local"
>> dropped because it is temporarily not available
>> >                             queue instance "reboot.q@compute-gpu-0.local"
>> dropped because it is temporarily not available
>> >                             queue instance
>> "qlogin.long.q@compute-0-20.local" dropped because it is full
>> >                             queue instance
>> "qlogin.long.q@compute-0-19.local" dropped because it is full
>> >                             queue instance
>> "qlogin.long.q@compute-gpu-0.local" dropped because it is full
>> >                             queue instance "basic.q@compute-1-2.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-13.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-4.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-2.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-12.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-17.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-3.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-8.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-5.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-11.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-15.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-7.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-14.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-18.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-10.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-6.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-gpu-0.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-16.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-9.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-0.local"
>> dropped because it is full
>> >                             queue instance "himem.q@compute-0-1.local"
>> dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-13.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-4.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-2.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-12.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-17.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-3.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-8.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-5.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-11.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-15.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-7.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-14.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-18.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-10.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-6.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-gpu-0.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-16.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-9.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-0.local" dropped because it is full
>> >                             queue instance
>> "qlogin.himem.q@compute-0-1.local" dropped because it is full
>> >                             queue instance "qlogin.q@compute-0-20.local"
>> dropped because it is full
>> >                             queue instance "qlogin.q@compute-0-19.local"
>> dropped because it is full
>> >                             queue instance "qlogin.q@compute-gpu-0.local"
>> dropped because it is full
>> >                             queue instance "qlogin.q@compute-0-7.local"
>> dropped because it is full
>> >                             queue instance "all.q@compute-0-0.local"
>> dropped because it is full
>> >                             cannot run in PE "int_test" because it only
>> offers 0 slots
>> >
>> > [mgstauff@chead ~]$ qquota -u mgstauff
>> > resource quota rule limit                filter
>> > ------------------------------------------------------------
>> --------------------
>> >
>> > [mgstauff@chead ~]$ qconf -srqs limit_user_slots
>> > {
>> >    name         limit_user_slots
>> >    description  Limit the users' batch slots
>> >    enabled      TRUE
>> >    limit        users {pcook,mgstauff} queues {allalt.q} to slots=32
>> >    limit        users {*} queues {allalt.q} to slots=0
>> >    limit        users {*} queues {himem.q} to slots=6
>> >    limit        users {*} queues {all.q,himem.q} to slots=32
>> >    limit        users {*} queues {basic.q} to slots=40
>> > }
>> >
>> > There are plenty of consumables available:
>> >
>> > [root@chead ~]# qstat -F h_vmem,s_vmem,slots -q all.q a
>> > queuename                      qtype resv/used/tot. load_avg arch
>>     states
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-0.local        BP    0/4/4          5.24     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         qc:slots=0
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-1.local        BP    0/10/15        9.58     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         qc:slots=5
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-10.local       BP    0/9/16         9.80     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         hc:slots=7
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-11.local       BP    0/11/16        9.18     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         hc:slots=5
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-12.local       BP    0/11/16        9.72     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         hc:slots=5
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-13.local       BP    0/10/16        9.14     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         hc:slots=6
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-14.local       BP    0/10/16        9.66     lx-amd64
>> >         hc:h_vmem=28.890G
>> >         hc:s_vmem=30.990G
>> >         hc:slots=6
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-15.local       BP    0/10/16        9.54     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         hc:slots=6
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-16.local       BP    0/10/16        10.01    lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         hc:slots=6
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-17.local       BP    0/11/16        9.75     lx-amd64
>> >         hc:h_vmem=29.963G
>> >         hc:s_vmem=32.960G
>> >         hc:slots=5
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-18.local       BP    0/11/16        10.29    lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         hc:slots=5
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-19.local       BP    0/9/14         9.01     lx-amd64
>> >         qf:h_vmem=5.000G
>> >         qf:s_vmem=5.000G
>> >         qc:slots=5
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-2.local        BP    0/10/15        9.24     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         qc:slots=5
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-20.local       BP    0/0/4          0.00     lx-amd64
>> >         qf:h_vmem=3.200G
>> >         qf:s_vmem=3.200G
>> >         qc:slots=4
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-3.local        BP    0/11/15        9.62     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         qc:slots=4
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-4.local        BP    0/12/15        9.85     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         qc:slots=3
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-5.local        BP    0/12/15        10.18    lx-amd64
>> >         hc:h_vmem=36.490G
>> >         hc:s_vmem=39.390G
>> >         qc:slots=3
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-6.local        BP    0/12/16        9.95     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         hc:slots=4
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-7.local        BP    0/10/16        9.59     lx-amd64
>> >         hc:h_vmem=36.935G
>> >         qf:s_vmem=40.000G
>> >         hc:slots=5
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-8.local        BP    0/10/16        9.37     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         hc:slots=6
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-0-9.local        BP    0/10/16        9.38     lx-amd64
>> >         qf:h_vmem=40.000G
>> >         qf:s_vmem=40.000G
>> >         hc:slots=6
>> > ------------------------------------------------------------
>> ---------------------
>> > all.q@compute-gpu-0.local      BP    0/0/4          0.05     lx-amd64
>> >         qf:h_vmem=3.200G
>> >         qf:s_vmem=3.200G
>> >         qc:slots=4
>> >
>> >
>> > On Mon, Feb 13, 2017 at 2:42 PM, Jesse Becker <becke...@mail.nih.gov>
>> wrote:
>> > On Mon, Feb 13, 2017 at 02:26:18PM -0500, Michael Stauffer wrote:
>> > SoGE 8.1.8
>> >
>> > Hi,
>> >
>> > I'm getting some queued jobs with scheduling info that includes this
>> line
>> > at the end:
>> >
>> > cannot run in PE "unihost" because it only offers 0 slots
>> >
>> > 'unihost' is the only PE I use. When users request multiple slots, they
>> use
>> > 'unihost':
>> >
>> > ... -binding linear:2 -pe unihost 2 ...
>> >
>> > What happens is that these jobs aren't running when it otherwise seems
>> like
>> > they should be, or they sit waiting in the queue for a long time even
>> when
>> > the user has plenty of quota available within the queue they've
>> requested,
>> > and there are enough resources available on the queue's nodes (slots and
>> > vram are consumables).
>> >
>> > Any suggestions about how I might further understand this?
>> >
>> > This *exact* problem has bitten me in the past.  It seems to crop up
>> > about every 3 years--long enough to remember it was a problem, and long
>> > enough to forget just what the [censored] I did to fix it.
>> >
>> > As I recall, it has little to do with actual PEs, but everything to do
>> > with complexes and resource requests.
>> >
>> > You might glean a bit more information by running "qsub -w p" (or "-w
>> e").
>> >
>> > Take a look at these previous discussions:
>> >
>> > http://gridengine.org/pipermail/users/2011-November/001932.html
>> > http://comments.gmane.org/gmane.comp.clustering.opengridengi
>> ne.user/1700
>> >
>> >
>> > --
>> > Jesse Becker (Contractor)
>> >
>> > _______________________________________________
>> > users mailing list
>> > users@gridengine.org
>> > https://gridengine.org/mailman/listinfo/users
>>
>>
>

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] PE offers 0 slots?

Reply via email to