Thanks for the reply Reuti, see below

On Fri, Aug 11, 2017 at 7:18 PM, Reuti <re...@staff.uni-marburg.de> wrote:

>
> What I notice below: defining h_vmem/s_vmem on a queue level means per
> job. Defining it on an exechost level means across all jobs. What is
> different between:
>
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-13.local       BP    0/10/16        9.14     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         hc:slots=6
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-14.local       BP    0/10/16        9.66     lx-amd64
> >         hc:h_vmem=28.890G
> >         hc:s_vmem=30.990G
> >         hc:slots=6
>
>
> qf = queue fixed
> hc = host consumable
>
> What is the definition of h_vmem/s_vmem in `qconf -sc` and their default
> consumptions?
>

I thought this means that when it's showing qf, it's the per-job queue
limit, i.e. the queue has a h_vmem and s_vmem limits for the job of 40G
(which it does). And then hc is shown when the host resources are less than
the per-job queue limit.

[root@chead ~]# qconf -sc | grep vmem
h_vmem              h_vmem     MEMORY      <=    YES         JOB
 3100M    0
s_vmem              s_vmem     MEMORY      <=    YES         JOB
 3000M    0

> 'unihost' is the only PE I use. When users request multiple slots, they
> use 'unihost':
> >
> > qsub ... -binding linear:2 -pe unihost 2 ...
> >
> > What happens is that these jobs aren't running when it otherwise seems
> like they should be, or they sit waiting in the queue for a long time even
> when the user has plenty of quota available within the queue they've
> requested, and there are enough resources available on the queue's nodes
> per qhost(slots and vmem are consumables), and qquota isn't showing any rqs
> limits have been reached.
> >
> > Below I've dumped relevant configurations.
> >
> > Today I created a new PE called "int_test" to test the "integer"
> allocation rule. I set it to 16 (16 cores per node), and have also tried 8.
> It's been added as a PE to the queues we use. When I try to run to this new
> PE however, it *always* fails with the same "PE ...offers 0 slots" error,
> even if I can run the same multi-slot job using "unihost" PE at the same
> time. I'm not sure if this helps debug or not.
> >
> > Another thought - this behavior started happening some time ago more or
> less when I tried implementing fairshare behavior. I never seemed to get
> fairshare working right. We haven't been able to confirm, but for some
> users it seems this "PE 0 slots" issue pops up only after they've been
> running other jobs for a little while. So I'm wondering if I've screwed up
> fairshare in some way that's causing this odd behavior.
> >
> > The default queue from global config file is all.q.
>
> There is no default queue in SGE. One specifies resource requests and SGE
> will select an appropriate one. What do you refer to by this?
>
> Do you have any sge_request or private .sge_request?
>

Yes, the global sge_request has '-q all.q'. I can't remember why this was
done when I first set things up years ago - I think the cluster I was
migrating from was set up that way and I just copied it.

Given my qconf '-ssconf' and '-sconf' output below, does something look off
with my fairshare setup (and subsequent attempt to disable it)? As I
mentioned, I'm wondering if something went wrong with how I set it up
because this intermittent behavior may have started at the same time.

-M

>
> > Here are various config dumps. Is there anything else that might be
> helpful?
> >
> > Thanks for any help! This has been plaguing me.
> >
> >
> > [root@chead ~]# qconf -sp unihost
> > pe_name            unihost
> > slots              9999
> > user_lists         NONE
> > xuser_lists        NONE
> > start_proc_args    /bin/true
> > stop_proc_args     /bin/true
> > allocation_rule    $pe_slots
> > control_slaves     FALSE
> > job_is_first_task  TRUE
> > urgency_slots      min
> > accounting_summary FALSE
> > qsort_args         NONE
> >
> > [root@chead ~]# qconf -sp int_test
> > pe_name            int_test
> > slots              9999
> > user_lists         NONE
> > xuser_lists        NONE
> > start_proc_args    /bin/true
> > stop_proc_args     /bin/true
> > allocation_rule    8
> > control_slaves     FALSE
> > job_is_first_task  TRUE
> > urgency_slots      min
> > accounting_summary FALSE
> > qsort_args         NONE
> >
> > [root@chead ~]# qconf -ssconf
> > algorithm                         default
> > schedule_interval                 0:0:5
> > maxujobs                          200
> > queue_sort_method                 load
> > job_load_adjustments              np_load_avg=0.50
> > load_adjustment_decay_time        0:7:30
> > load_formula                      np_load_avg
> > schedd_job_info                   true
> > flush_submit_sec                  0
> > flush_finish_sec                  0
> > params                            none
> > reprioritize_interval             0:0:0
> > halftime                          1
> > usage_weight_list                 cpu=0.700000,mem=0.200000,io=0.100000
> > compensation_factor               5.000000
> > weight_user                       0.250000
> > weight_project                    0.250000
> > weight_department                 0.250000
> > weight_job                        0.250000
> > weight_tickets_functional         1000
> > weight_tickets_share              100000
> > share_override_tickets            TRUE
> > share_functional_shares           TRUE
> > max_functional_jobs_to_schedule   2000
> > report_pjob_tickets               TRUE
> > max_pending_tasks_per_job         100
> > halflife_decay_list               none
> > policy_hierarchy                  OS
> > weight_ticket                     0.000000
> > weight_waiting_time               1.000000
> > weight_deadline                   3600000.000000
> > weight_urgency                    0.100000
> > weight_priority                   1.000000
> > max_reservation                   0
> > default_duration                  INFINITY
> >
> > [root@chead ~]# qconf -sconf
> > #global:
> > execd_spool_dir              /opt/sge/default/spool
> > mailer                       /bin/mail
> > xterm                        /usr/bin/X11/xterm
> > load_sensor                  none
> > prolog                       none
> > epilog                       none
> > shell_start_mode             posix_compliant
> > login_shells                 sh,bash,ksh,csh,tcsh
> > min_uid                      0
> > min_gid                      0
> > user_lists                   none
> > xuser_lists                  none
> > projects                     none
> > xprojects                    none
> > enforce_project              false
> > enforce_user                 auto
> > load_report_time             00:00:40
> > max_unheard                  00:05:00
> > reschedule_unknown           02:00:00
> > loglevel                     log_warning
> > administrator_mail           none
> > set_token_cmd                none
> > pag_cmd                      none
> > token_extend_time            none
> > shepherd_cmd                 none
> > qmaster_params               none
> > execd_params                 ENABLE_BINDING=true
> > reporting_params             accounting=true reporting=true \
> >                              flush_time=00:00:15 joblog=true
> sharelog=00:00:00
> > finished_jobs                100
> > gid_range                    20000-20100
> > qlogin_command               /opt/sge/bin/cfn-qlogin.sh
> > qlogin_daemon                /usr/sbin/sshd -i
> > rlogin_command               builtin
> > rlogin_daemon                builtin
> > rsh_command                  builtin
> > rsh_daemon                   builtin
> > max_aj_instances             2000
> > max_aj_tasks                 75000
> > max_u_jobs                   4000
> > max_jobs                     0
> > max_advance_reservations     0
> > auto_user_oticket            0
> > auto_user_fshare             100
> > auto_user_default_project    none
> > auto_user_delete_time        0
> > delegated_file_staging       false
> > reprioritize                 0
> > jsv_url                      none
> > jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
> >
> > [root@chead ~]# qconf -sq all.q
> > qname                 all.q
> > hostlist              @allhosts
> > seq_no                0
> > load_thresholds       np_load_avg=1.75
> > suspend_thresholds    NONE
> > nsuspend              1
> > suspend_interval      00:05:00
> > priority              0
> > min_cpu_interval      00:05:00
> > processors            UNDEFINED
> > qtype                 BATCH
> > ckpt_list             NONE
> > pe_list               make mpich mpi orte unihost serial int_test
> unihost2
> > rerun                 FALSE
> > slots                 1,[compute-0-0.local=4],[compute-0-1.local=15], \
> >                       [compute-0-2.local=15],[compute-0-3.local=15], \
> >                       [compute-0-4.local=15],[compute-0-5.local=15], \
> >                       [compute-0-6.local=16],[compute-0-7.local=16], \
> >                       [compute-0-9.local=16],[compute-0-10.local=16], \
> >                       [compute-0-11.local=16],[compute-0-12.local=16], \
> >                       [compute-0-13.local=16],[compute-0-14.local=16], \
> >                       [compute-0-15.local=16],[compute-0-16.local=16], \
> >                       [compute-0-17.local=16],[compute-0-18.local=16], \
> >                       [compute-0-8.local=16],[compute-0-19.local=14], \
> >                       [compute-0-20.local=4],[compute-gpu-0.local=4]
> > tmpdir                /tmp
> > shell                 /bin/bash
> > prolog                NONE
> > epilog                NONE
> > shell_start_mode      posix_compliant
> > starter_method        NONE
> > suspend_method        NONE
> > resume_method         NONE
> > terminate_method      NONE
> > notify                00:00:60
> > owner_list            NONE
> > user_lists            NONE
> > xuser_lists           NONE
> > subordinate_list      NONE
> > complex_values        NONE
> > projects              NONE
> > xprojects             NONE
> > calendar              NONE
> > initial_state         default
> > s_rt                  INFINITY
> > h_rt                  INFINITY
> > s_cpu                 INFINITY
> > h_cpu                 INFINITY
> > s_fsize               INFINITY
> > h_fsize               INFINITY
> > s_data                INFINITY
> > h_data                INFINITY
> > s_stack               INFINITY
> > h_stack               INFINITY
> > s_core                INFINITY
> > h_core                INFINITY
> > s_rss                 INFINITY
> > h_rss                 INFINITY
> > s_vmem                40G,[compute-0-20.local=3.2G], \
> >                       [compute-gpu-0.local=3.2G],[compute-0-19.local=5G]
> > h_vmem                40G,[compute-0-20.local=3.2G], \
> >                       [compute-gpu-0.local=3.2G],[compute-0-19.local=5G]
> >
> > qstat -j on a stuck job as an example:
> >
> > [mgstauff@chead ~]$ qstat -j 3714924
> > ==============================================================
> > job_number:                 3714924
> > exec_file:                  job_scripts/3714924
> > submission_time:            Fri Aug 11 12:48:47 2017
> > owner:                      mgstauff
> > uid:                        2198
> > group:                      mgstauff
> > gid:                        2198
> > sge_o_home:                 /home/mgstauff
> > sge_o_log_name:             mgstauff
> > sge_o_path:                 /share/apps/mricron/ver_2015_
> 06_01:/share/apps/afni/linux_xorg7_64_2014_06_16:/share/
> apps/c3d/c3d-1.0.0-Linux-x86_64/bin:/share/apps/freesurfer/
> 5.3.0/bin:/share/apps/freesurfer/5.3.0/fsfast/bin:/
> share/apps/freesurfer/5.3.0/tktools:/share/apps/fsl/5.0.8/
> bin:/share/apps/freesurfer/5.3.0/mni/bin:/share/apps/fsl/5.
> 0.8/bin:/share/apps/pandoc/1.12.4.2-in-rstudio/:/opt/
> openmpi/bin:/usr/lib64/qt-3.3/bin:/opt/sge/bin:/opt/sge/bin/
> lx-amd64:/opt/sge/bin:/opt/sge/bin/lx-amd64:/share/admin:
> /opt/perfsonar_ps/toolkit/scripts:/usr/dbxml-2.3.11/bin:
> /usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/
> sbin:/opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/
> EMBOSS/bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/
> bio/hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/
> fasta:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/
> bio/gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/
> bio/autodocksuite/bin:/opt/bio/wgs/bin:/opt/ganglia/bin:/
> opt/ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/
> opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/dell/
> srvadmin/bin:/home/mgstauff/bin:/share/apps/R/R-3.1.1/bin:
> /share/apps/rstudio/rstudio-0.98.1091/bin/:/share/apps/ANTs/
> 2014-06-23/build/bin/:/share/apps/matlab/R2014b/bin/:/
> share/apps/BrainVISA/brainvisa-Mandriva-2008.0-x86_
> 64-4.4.0-2013_11_18:/share/apps/MIPAV/7.1.0_release:/
> share/apps/itksnap/itksnap-most-recent/bin/:/share/apps/
> MRtrix3/2016-04-25/mrtrix3/release/bin/:/share/apps/VoxBo/bin
> > sge_o_shell:                /bin/bash
> > sge_o_workdir:              /home/mgstauff
> > sge_o_host:                 chead
> > account:                    sge
> > hard resource_list:         h_stack=128m
> > mail_list:                  mgstauff@chead.local
> > notify:                     FALSE
> > job_name:                   myjobparam
> > jobshare:                   0
> > hard_queue_list:            all.q
> > env_list:                   TERM=NONE
> > job_args:                   5
> > script_file:                workshop-files/myjobparam
> > parallel environment:  int_test range: 2
> > binding:                    set linear:2
> > job_type:                   NONE
> > scheduling info:            queue instance "gpu.q@compute-gpu-0.local"
> dropped because it is temporarily not available
> >                             queue instance 
> > "qlogin.gpu.q@compute-gpu-0.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-18.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-17.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-16.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-13.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-15.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-14.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-12.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-11.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-10.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-9.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-5.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-6.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-7.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-8.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-4.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-2.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-1.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-0.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-20.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-19.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-0-3.local"
> dropped because it is temporarily not available
> >                             queue instance "reboot.q@compute-gpu-0.local"
> dropped because it is temporarily not available
> >                             queue instance 
> > "qlogin.long.q@compute-0-20.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.long.q@compute-0-19.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.long.q@compute-gpu-0.local"
> dropped because it is full
> >                             queue instance "basic.q@compute-1-2.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-13.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-4.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-2.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-12.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-17.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-3.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-8.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-5.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-11.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-15.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-7.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-14.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-18.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-10.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-6.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-gpu-0.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-16.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-9.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-0.local"
> dropped because it is full
> >                             queue instance "himem.q@compute-0-1.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-13.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-4.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-2.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-12.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-17.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-3.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-8.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-5.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-11.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-15.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-7.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-14.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-18.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-10.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-6.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-gpu-0.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-16.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-9.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-0.local"
> dropped because it is full
> >                             queue instance 
> > "qlogin.himem.q@compute-0-1.local"
> dropped because it is full
> >                             queue instance "qlogin.q@compute-0-20.local"
> dropped because it is full
> >                             queue instance "qlogin.q@compute-0-19.local"
> dropped because it is full
> >                             queue instance "qlogin.q@compute-gpu-0.local"
> dropped because it is full
> >                             queue instance "qlogin.q@compute-0-7.local"
> dropped because it is full
> >                             queue instance "all.q@compute-0-0.local"
> dropped because it is full
> >                             cannot run in PE "int_test" because it only
> offers 0 slots
> >
> > [mgstauff@chead ~]$ qquota -u mgstauff
> > resource quota rule limit                filter
> > ------------------------------------------------------------
> --------------------
> >
> > [mgstauff@chead ~]$ qconf -srqs limit_user_slots
> > {
> >    name         limit_user_slots
> >    description  Limit the users' batch slots
> >    enabled      TRUE
> >    limit        users {pcook,mgstauff} queues {allalt.q} to slots=32
> >    limit        users {*} queues {allalt.q} to slots=0
> >    limit        users {*} queues {himem.q} to slots=6
> >    limit        users {*} queues {all.q,himem.q} to slots=32
> >    limit        users {*} queues {basic.q} to slots=40
> > }
> >
> > There are plenty of consumables available:
> >
> > [root@chead ~]# qstat -F h_vmem,s_vmem,slots -q all.q a
> > queuename                      qtype resv/used/tot. load_avg arch
>   states
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-0.local        BP    0/4/4          5.24     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         qc:slots=0
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-1.local        BP    0/10/15        9.58     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         qc:slots=5
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-10.local       BP    0/9/16         9.80     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         hc:slots=7
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-11.local       BP    0/11/16        9.18     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         hc:slots=5
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-12.local       BP    0/11/16        9.72     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         hc:slots=5
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-13.local       BP    0/10/16        9.14     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         hc:slots=6
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-14.local       BP    0/10/16        9.66     lx-amd64
> >         hc:h_vmem=28.890G
> >         hc:s_vmem=30.990G
> >         hc:slots=6
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-15.local       BP    0/10/16        9.54     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         hc:slots=6
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-16.local       BP    0/10/16        10.01    lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         hc:slots=6
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-17.local       BP    0/11/16        9.75     lx-amd64
> >         hc:h_vmem=29.963G
> >         hc:s_vmem=32.960G
> >         hc:slots=5
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-18.local       BP    0/11/16        10.29    lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         hc:slots=5
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-19.local       BP    0/9/14         9.01     lx-amd64
> >         qf:h_vmem=5.000G
> >         qf:s_vmem=5.000G
> >         qc:slots=5
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-2.local        BP    0/10/15        9.24     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         qc:slots=5
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-20.local       BP    0/0/4          0.00     lx-amd64
> >         qf:h_vmem=3.200G
> >         qf:s_vmem=3.200G
> >         qc:slots=4
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-3.local        BP    0/11/15        9.62     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         qc:slots=4
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-4.local        BP    0/12/15        9.85     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         qc:slots=3
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-5.local        BP    0/12/15        10.18    lx-amd64
> >         hc:h_vmem=36.490G
> >         hc:s_vmem=39.390G
> >         qc:slots=3
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-6.local        BP    0/12/16        9.95     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         hc:slots=4
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-7.local        BP    0/10/16        9.59     lx-amd64
> >         hc:h_vmem=36.935G
> >         qf:s_vmem=40.000G
> >         hc:slots=5
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-8.local        BP    0/10/16        9.37     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         hc:slots=6
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-0-9.local        BP    0/10/16        9.38     lx-amd64
> >         qf:h_vmem=40.000G
> >         qf:s_vmem=40.000G
> >         hc:slots=6
> > ------------------------------------------------------------
> ---------------------
> > all.q@compute-gpu-0.local      BP    0/0/4          0.05     lx-amd64
> >         qf:h_vmem=3.200G
> >         qf:s_vmem=3.200G
> >         qc:slots=4
> >
> >
> > On Mon, Feb 13, 2017 at 2:42 PM, Jesse Becker <becke...@mail.nih.gov>
> wrote:
> > On Mon, Feb 13, 2017 at 02:26:18PM -0500, Michael Stauffer wrote:
> > SoGE 8.1.8
> >
> > Hi,
> >
> > I'm getting some queued jobs with scheduling info that includes this line
> > at the end:
> >
> > cannot run in PE "unihost" because it only offers 0 slots
> >
> > 'unihost' is the only PE I use. When users request multiple slots, they
> use
> > 'unihost':
> >
> > ... -binding linear:2 -pe unihost 2 ...
> >
> > What happens is that these jobs aren't running when it otherwise seems
> like
> > they should be, or they sit waiting in the queue for a long time even
> when
> > the user has plenty of quota available within the queue they've
> requested,
> > and there are enough resources available on the queue's nodes (slots and
> > vram are consumables).
> >
> > Any suggestions about how I might further understand this?
> >
> > This *exact* problem has bitten me in the past.  It seems to crop up
> > about every 3 years--long enough to remember it was a problem, and long
> > enough to forget just what the [censored] I did to fix it.
> >
> > As I recall, it has little to do with actual PEs, but everything to do
> > with complexes and resource requests.
> >
> > You might glean a bit more information by running "qsub -w p" (or "-w
> e").
> >
> > Take a look at these previous discussions:
> >
> > http://gridengine.org/pipermail/users/2011-November/001932.html
> > http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/1700
> >
> >
> > --
> > Jesse Becker (Contractor)
> >
> > _______________________________________________
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
>
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to