> Am 13.08.2017 um 18:11 schrieb Michael Stauffer <mgsta...@gmail.com>: > > Thanks for the reply Reuti, see below > > On Fri, Aug 11, 2017 at 7:18 PM, Reuti <re...@staff.uni-marburg.de> wrote: > > What I notice below: defining h_vmem/s_vmem on a queue level means per job. > Defining it on an exechost level means across all jobs. What is different > between: > > > --------------------------------------------------------------------------------- > > all.q@compute-0-13.local BP 0/10/16 9.14 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > --------------------------------------------------------------------------------- > > all.q@compute-0-14.local BP 0/10/16 9.66 lx-amd64 > > hc:h_vmem=28.890G > > hc:s_vmem=30.990G > > hc:slots=6 > > > qf = queue fixed > hc = host consumable > > What is the definition of h_vmem/s_vmem in `qconf -sc` and their default > consumptions? > > I thought this means that when it's showing qf, it's the per-job queue limit, > i.e. the queue has a h_vmem and s_vmem limits for the job of 40G (which it > does). And then hc is shown when the host resources are less than the per-job > queue limit.
Yes, the lower limit should be shown. So it's defined on both sides: exechost and queue? -- Reuti > [root@chead ~]# qconf -sc | grep vmem > h_vmem h_vmem MEMORY <= YES JOB 3100M > 0 > s_vmem s_vmem MEMORY <= YES JOB 3000M > 0 > > > 'unihost' is the only PE I use. When users request multiple slots, they use > > 'unihost': > > > > qsub ... -binding linear:2 -pe unihost 2 ... > > > > What happens is that these jobs aren't running when it otherwise seems like > > they should be, or they sit waiting in the queue for a long time even when > > the user has plenty of quota available within the queue they've requested, > > and there are enough resources available on the queue's nodes per > > qhost(slots and vmem are consumables), and qquota isn't showing any rqs > > limits have been reached. > > > > Below I've dumped relevant configurations. > > > > Today I created a new PE called "int_test" to test the "integer" allocation > > rule. I set it to 16 (16 cores per node), and have also tried 8. It's been > > added as a PE to the queues we use. When I try to run to this new PE > > however, it *always* fails with the same "PE ...offers 0 slots" error, even > > if I can run the same multi-slot job using "unihost" PE at the same time. > > I'm not sure if this helps debug or not. > > > > Another thought - this behavior started happening some time ago more or > > less when I tried implementing fairshare behavior. I never seemed to get > > fairshare working right. We haven't been able to confirm, but for some > > users it seems this "PE 0 slots" issue pops up only after they've been > > running other jobs for a little while. So I'm wondering if I've screwed up > > fairshare in some way that's causing this odd behavior. > > > > The default queue from global config file is all.q. > > There is no default queue in SGE. One specifies resource requests and SGE > will select an appropriate one. What do you refer to by this? > > Do you have any sge_request or private .sge_request? > > Yes, the global sge_request has '-q all.q'. I can't remember why this was > done when I first set things up years ago - I think the cluster I was > migrating from was set up that way and I just copied it. > > Given my qconf '-ssconf' and '-sconf' output below, does something look off > with my fairshare setup (and subsequent attempt to disable it)? As I > mentioned, I'm wondering if something went wrong with how I set it up because > this intermittent behavior may have started at the same time. > > -M > > > > > Here are various config dumps. Is there anything else that might be helpful? > > > > Thanks for any help! This has been plaguing me. > > > > > > [root@chead ~]# qconf -sp unihost > > pe_name unihost > > slots 9999 > > user_lists NONE > > xuser_lists NONE > > start_proc_args /bin/true > > stop_proc_args /bin/true > > allocation_rule $pe_slots > > control_slaves FALSE > > job_is_first_task TRUE > > urgency_slots min > > accounting_summary FALSE > > qsort_args NONE > > > > [root@chead ~]# qconf -sp int_test > > pe_name int_test > > slots 9999 > > user_lists NONE > > xuser_lists NONE > > start_proc_args /bin/true > > stop_proc_args /bin/true > > allocation_rule 8 > > control_slaves FALSE > > job_is_first_task TRUE > > urgency_slots min > > accounting_summary FALSE > > qsort_args NONE > > > > [root@chead ~]# qconf -ssconf > > algorithm default > > schedule_interval 0:0:5 > > maxujobs 200 > > queue_sort_method load > > job_load_adjustments np_load_avg=0.50 > > load_adjustment_decay_time 0:7:30 > > load_formula np_load_avg > > schedd_job_info true > > flush_submit_sec 0 > > flush_finish_sec 0 > > params none > > reprioritize_interval 0:0:0 > > halftime 1 > > usage_weight_list cpu=0.700000,mem=0.200000,io=0.100000 > > compensation_factor 5.000000 > > weight_user 0.250000 > > weight_project 0.250000 > > weight_department 0.250000 > > weight_job 0.250000 > > weight_tickets_functional 1000 > > weight_tickets_share 100000 > > share_override_tickets TRUE > > share_functional_shares TRUE > > max_functional_jobs_to_schedule 2000 > > report_pjob_tickets TRUE > > max_pending_tasks_per_job 100 > > halflife_decay_list none > > policy_hierarchy OS > > weight_ticket 0.000000 > > weight_waiting_time 1.000000 > > weight_deadline 3600000.000000 > > weight_urgency 0.100000 > > weight_priority 1.000000 > > max_reservation 0 > > default_duration INFINITY > > > > [root@chead ~]# qconf -sconf > > #global: > > execd_spool_dir /opt/sge/default/spool > > mailer /bin/mail > > xterm /usr/bin/X11/xterm > > load_sensor none > > prolog none > > epilog none > > shell_start_mode posix_compliant > > login_shells sh,bash,ksh,csh,tcsh > > min_uid 0 > > min_gid 0 > > user_lists none > > xuser_lists none > > projects none > > xprojects none > > enforce_project false > > enforce_user auto > > load_report_time 00:00:40 > > max_unheard 00:05:00 > > reschedule_unknown 02:00:00 > > loglevel log_warning > > administrator_mail none > > set_token_cmd none > > pag_cmd none > > token_extend_time none > > shepherd_cmd none > > qmaster_params none > > execd_params ENABLE_BINDING=true > > reporting_params accounting=true reporting=true \ > > flush_time=00:00:15 joblog=true > > sharelog=00:00:00 > > finished_jobs 100 > > gid_range 20000-20100 > > qlogin_command /opt/sge/bin/cfn-qlogin.sh > > qlogin_daemon /usr/sbin/sshd -i > > rlogin_command builtin > > rlogin_daemon builtin > > rsh_command builtin > > rsh_daemon builtin > > max_aj_instances 2000 > > max_aj_tasks 75000 > > max_u_jobs 4000 > > max_jobs 0 > > max_advance_reservations 0 > > auto_user_oticket 0 > > auto_user_fshare 100 > > auto_user_default_project none > > auto_user_delete_time 0 > > delegated_file_staging false > > reprioritize 0 > > jsv_url none > > jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w > > > > [root@chead ~]# qconf -sq all.q > > qname all.q > > hostlist @allhosts > > seq_no 0 > > load_thresholds np_load_avg=1.75 > > suspend_thresholds NONE > > nsuspend 1 > > suspend_interval 00:05:00 > > priority 0 > > min_cpu_interval 00:05:00 > > processors UNDEFINED > > qtype BATCH > > ckpt_list NONE > > pe_list make mpich mpi orte unihost serial int_test unihost2 > > rerun FALSE > > slots 1,[compute-0-0.local=4],[compute-0-1.local=15], \ > > [compute-0-2.local=15],[compute-0-3.local=15], \ > > [compute-0-4.local=15],[compute-0-5.local=15], \ > > [compute-0-6.local=16],[compute-0-7.local=16], \ > > [compute-0-9.local=16],[compute-0-10.local=16], \ > > [compute-0-11.local=16],[compute-0-12.local=16], \ > > [compute-0-13.local=16],[compute-0-14.local=16], \ > > [compute-0-15.local=16],[compute-0-16.local=16], \ > > [compute-0-17.local=16],[compute-0-18.local=16], \ > > [compute-0-8.local=16],[compute-0-19.local=14], \ > > [compute-0-20.local=4],[compute-gpu-0.local=4] > > tmpdir /tmp > > shell /bin/bash > > prolog NONE > > epilog NONE > > shell_start_mode posix_compliant > > starter_method NONE > > suspend_method NONE > > resume_method NONE > > terminate_method NONE > > notify 00:00:60 > > owner_list NONE > > user_lists NONE > > xuser_lists NONE > > subordinate_list NONE > > complex_values NONE > > projects NONE > > xprojects NONE > > calendar NONE > > initial_state default > > s_rt INFINITY > > h_rt INFINITY > > s_cpu INFINITY > > h_cpu INFINITY > > s_fsize INFINITY > > h_fsize INFINITY > > s_data INFINITY > > h_data INFINITY > > s_stack INFINITY > > h_stack INFINITY > > s_core INFINITY > > h_core INFINITY > > s_rss INFINITY > > h_rss INFINITY > > s_vmem 40G,[compute-0-20.local=3.2G], \ > > [compute-gpu-0.local=3.2G],[compute-0-19.local=5G] > > h_vmem 40G,[compute-0-20.local=3.2G], \ > > [compute-gpu-0.local=3.2G],[compute-0-19.local=5G] > > > > qstat -j on a stuck job as an example: > > > > [mgstauff@chead ~]$ qstat -j 3714924 > > ============================================================== > > job_number: 3714924 > > exec_file: job_scripts/3714924 > > submission_time: Fri Aug 11 12:48:47 2017 > > owner: mgstauff > > uid: 2198 > > group: mgstauff > > gid: 2198 > > sge_o_home: /home/mgstauff > > sge_o_log_name: mgstauff > > sge_o_path: > > /share/apps/mricron/ver_2015_06_01:/share/apps/afni/linux_xorg7_64_2014_06_16:/share/apps/c3d/c3d-1.0.0-Linux-x86_64/bin:/share/apps/freesurfer/5.3.0/bin:/share/apps/freesurfer/5.3.0/fsfast/bin:/share/apps/freesurfer/5.3.0/tktools:/share/apps/fsl/5.0.8/bin:/share/apps/freesurfer/5.3.0/mni/bin:/share/apps/fsl/5.0.8/bin:/share/apps/pandoc/1.12.4.2-in-rstudio/:/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/opt/sge/bin:/opt/sge/bin/lx-amd64:/opt/sge/bin:/opt/sge/bin/lx-amd64:/share/admin:/opt/perfsonar_ps/toolkit/scripts:/usr/dbxml-2.3.11/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/EMBOSS/bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/bio/hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/fasta:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/bio/gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/bio/autodocksuite/bin:/opt/bio/wgs/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/u! sr/java/latest/bin:/opt/maven/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/dell/srvadmin/bin:/home/mgstauff/bin:/share/apps/R/R-3.1.1/bin:/share/apps/rstudio/rstudio-0.98.1091/bin/:/share/apps/ANTs/2014-06-23/build/bin/:/share/apps/matlab/R2014b/bin/:/share/apps/BrainVISA/brainvisa-Mandriva-2008.0-x86_64-4.4.0-2013_11_18:/share/apps/MIPAV/7.1.0_release:/share/apps/itksnap/itksnap-most-recent/bin/:/share/apps/MRtrix3/2016-04-25/mrtrix3/release/bin/:/share/apps/VoxBo/bin > > sge_o_shell: /bin/bash > > sge_o_workdir: /home/mgstauff > > sge_o_host: chead > > account: sge > > hard resource_list: h_stack=128m > > mail_list: mgstauff@chead.local > > notify: FALSE > > job_name: myjobparam > > jobshare: 0 > > hard_queue_list: all.q > > env_list: TERM=NONE > > job_args: 5 > > script_file: workshop-files/myjobparam > > parallel environment: int_test range: 2 > > binding: set linear:2 > > job_type: NONE > > scheduling info: queue instance "gpu.q@compute-gpu-0.local" > > dropped because it is temporarily not available > > queue instance > > "qlogin.gpu.q@compute-gpu-0.local" dropped because it is temporarily not > > available > > queue instance "reboot.q@compute-0-18.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-17.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-16.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-13.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-15.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-14.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-12.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-11.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-10.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-9.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-5.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-6.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-7.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-8.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-4.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-2.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-1.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-0.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-20.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-19.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-3.local" > > dropped because it is temporarily not available > > queue instance "reboot.q@compute-gpu-0.local" > > dropped because it is temporarily not available > > queue instance > > "qlogin.long.q@compute-0-20.local" dropped because it is full > > queue instance > > "qlogin.long.q@compute-0-19.local" dropped because it is full > > queue instance > > "qlogin.long.q@compute-gpu-0.local" dropped because it is full > > queue instance "basic.q@compute-1-2.local" > > dropped because it is full > > queue instance "himem.q@compute-0-13.local" > > dropped because it is full > > queue instance "himem.q@compute-0-4.local" > > dropped because it is full > > queue instance "himem.q@compute-0-2.local" > > dropped because it is full > > queue instance "himem.q@compute-0-12.local" > > dropped because it is full > > queue instance "himem.q@compute-0-17.local" > > dropped because it is full > > queue instance "himem.q@compute-0-3.local" > > dropped because it is full > > queue instance "himem.q@compute-0-8.local" > > dropped because it is full > > queue instance "himem.q@compute-0-5.local" > > dropped because it is full > > queue instance "himem.q@compute-0-11.local" > > dropped because it is full > > queue instance "himem.q@compute-0-15.local" > > dropped because it is full > > queue instance "himem.q@compute-0-7.local" > > dropped because it is full > > queue instance "himem.q@compute-0-14.local" > > dropped because it is full > > queue instance "himem.q@compute-0-18.local" > > dropped because it is full > > queue instance "himem.q@compute-0-10.local" > > dropped because it is full > > queue instance "himem.q@compute-0-6.local" > > dropped because it is full > > queue instance "himem.q@compute-gpu-0.local" > > dropped because it is full > > queue instance "himem.q@compute-0-16.local" > > dropped because it is full > > queue instance "himem.q@compute-0-9.local" > > dropped because it is full > > queue instance "himem.q@compute-0-0.local" > > dropped because it is full > > queue instance "himem.q@compute-0-1.local" > > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-13.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-4.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-2.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-12.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-17.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-3.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-8.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-5.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-11.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-15.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-7.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-14.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-18.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-10.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-6.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-gpu-0.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-16.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-9.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-0.local" dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-1.local" dropped because it is full > > queue instance "qlogin.q@compute-0-20.local" > > dropped because it is full > > queue instance "qlogin.q@compute-0-19.local" > > dropped because it is full > > queue instance "qlogin.q@compute-gpu-0.local" > > dropped because it is full > > queue instance "qlogin.q@compute-0-7.local" > > dropped because it is full > > queue instance "all.q@compute-0-0.local" > > dropped because it is full > > cannot run in PE "int_test" because it only > > offers 0 slots > > > > [mgstauff@chead ~]$ qquota -u mgstauff > > resource quota rule limit filter > > -------------------------------------------------------------------------------- > > > > [mgstauff@chead ~]$ qconf -srqs limit_user_slots > > { > > name limit_user_slots > > description Limit the users' batch slots > > enabled TRUE > > limit users {pcook,mgstauff} queues {allalt.q} to slots=32 > > limit users {*} queues {allalt.q} to slots=0 > > limit users {*} queues {himem.q} to slots=6 > > limit users {*} queues {all.q,himem.q} to slots=32 > > limit users {*} queues {basic.q} to slots=40 > > } > > > > There are plenty of consumables available: > > > > [root@chead ~]# qstat -F h_vmem,s_vmem,slots -q all.q a > > queuename qtype resv/used/tot. load_avg arch > > states > > --------------------------------------------------------------------------------- > > all.q@compute-0-0.local BP 0/4/4 5.24 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > qc:slots=0 > > --------------------------------------------------------------------------------- > > all.q@compute-0-1.local BP 0/10/15 9.58 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > qc:slots=5 > > --------------------------------------------------------------------------------- > > all.q@compute-0-10.local BP 0/9/16 9.80 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=7 > > --------------------------------------------------------------------------------- > > all.q@compute-0-11.local BP 0/11/16 9.18 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=5 > > --------------------------------------------------------------------------------- > > all.q@compute-0-12.local BP 0/11/16 9.72 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=5 > > --------------------------------------------------------------------------------- > > all.q@compute-0-13.local BP 0/10/16 9.14 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > --------------------------------------------------------------------------------- > > all.q@compute-0-14.local BP 0/10/16 9.66 lx-amd64 > > hc:h_vmem=28.890G > > hc:s_vmem=30.990G > > hc:slots=6 > > --------------------------------------------------------------------------------- > > all.q@compute-0-15.local BP 0/10/16 9.54 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > --------------------------------------------------------------------------------- > > all.q@compute-0-16.local BP 0/10/16 10.01 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > --------------------------------------------------------------------------------- > > all.q@compute-0-17.local BP 0/11/16 9.75 lx-amd64 > > hc:h_vmem=29.963G > > hc:s_vmem=32.960G > > hc:slots=5 > > --------------------------------------------------------------------------------- > > all.q@compute-0-18.local BP 0/11/16 10.29 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=5 > > --------------------------------------------------------------------------------- > > all.q@compute-0-19.local BP 0/9/14 9.01 lx-amd64 > > qf:h_vmem=5.000G > > qf:s_vmem=5.000G > > qc:slots=5 > > --------------------------------------------------------------------------------- > > all.q@compute-0-2.local BP 0/10/15 9.24 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > qc:slots=5 > > --------------------------------------------------------------------------------- > > all.q@compute-0-20.local BP 0/0/4 0.00 lx-amd64 > > qf:h_vmem=3.200G > > qf:s_vmem=3.200G > > qc:slots=4 > > --------------------------------------------------------------------------------- > > all.q@compute-0-3.local BP 0/11/15 9.62 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > qc:slots=4 > > --------------------------------------------------------------------------------- > > all.q@compute-0-4.local BP 0/12/15 9.85 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > qc:slots=3 > > --------------------------------------------------------------------------------- > > all.q@compute-0-5.local BP 0/12/15 10.18 lx-amd64 > > hc:h_vmem=36.490G > > hc:s_vmem=39.390G > > qc:slots=3 > > --------------------------------------------------------------------------------- > > all.q@compute-0-6.local BP 0/12/16 9.95 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=4 > > --------------------------------------------------------------------------------- > > all.q@compute-0-7.local BP 0/10/16 9.59 lx-amd64 > > hc:h_vmem=36.935G > > qf:s_vmem=40.000G > > hc:slots=5 > > --------------------------------------------------------------------------------- > > all.q@compute-0-8.local BP 0/10/16 9.37 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > --------------------------------------------------------------------------------- > > all.q@compute-0-9.local BP 0/10/16 9.38 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > --------------------------------------------------------------------------------- > > all.q@compute-gpu-0.local BP 0/0/4 0.05 lx-amd64 > > qf:h_vmem=3.200G > > qf:s_vmem=3.200G > > qc:slots=4 > > > > > > On Mon, Feb 13, 2017 at 2:42 PM, Jesse Becker <becke...@mail.nih.gov> wrote: > > On Mon, Feb 13, 2017 at 02:26:18PM -0500, Michael Stauffer wrote: > > SoGE 8.1.8 > > > > Hi, > > > > I'm getting some queued jobs with scheduling info that includes this line > > at the end: > > > > cannot run in PE "unihost" because it only offers 0 slots > > > > 'unihost' is the only PE I use. When users request multiple slots, they use > > 'unihost': > > > > ... -binding linear:2 -pe unihost 2 ... > > > > What happens is that these jobs aren't running when it otherwise seems like > > they should be, or they sit waiting in the queue for a long time even when > > the user has plenty of quota available within the queue they've requested, > > and there are enough resources available on the queue's nodes (slots and > > vram are consumables). > > > > Any suggestions about how I might further understand this? > > > > This *exact* problem has bitten me in the past. It seems to crop up > > about every 3 years--long enough to remember it was a problem, and long > > enough to forget just what the [censored] I did to fix it. > > > > As I recall, it has little to do with actual PEs, but everything to do > > with complexes and resource requests. > > > > You might glean a bit more information by running "qsub -w p" (or "-w e"). > > > > Take a look at these previous discussions: > > > > http://gridengine.org/pipermail/users/2011-November/001932.html > > http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/1700 > > > > > > -- > > Jesse Becker (Contractor) > > > > _______________________________________________ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users