Thanks for the reply Reuti, see below On Fri, Aug 11, 2017 at 7:18 PM, Reuti <re...@staff.uni-marburg.de> wrote:
> > What I notice below: defining h_vmem/s_vmem on a queue level means per > job. Defining it on an exechost level means across all jobs. What is > different between: > > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-13.local BP 0/10/16 9.14 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-14.local BP 0/10/16 9.66 lx-amd64 > > hc:h_vmem=28.890G > > hc:s_vmem=30.990G > > hc:slots=6 > > > qf = queue fixed > hc = host consumable > > What is the definition of h_vmem/s_vmem in `qconf -sc` and their default > consumptions? > I thought this means that when it's showing qf, it's the per-job queue limit, i.e. the queue has a h_vmem and s_vmem limits for the job of 40G (which it does). And then hc is shown when the host resources are less than the per-job queue limit. [root@chead ~]# qconf -sc | grep vmem h_vmem h_vmem MEMORY <= YES JOB 3100M 0 s_vmem s_vmem MEMORY <= YES JOB 3000M 0 > 'unihost' is the only PE I use. When users request multiple slots, they > use 'unihost': > > > > qsub ... -binding linear:2 -pe unihost 2 ... > > > > What happens is that these jobs aren't running when it otherwise seems > like they should be, or they sit waiting in the queue for a long time even > when the user has plenty of quota available within the queue they've > requested, and there are enough resources available on the queue's nodes > per qhost(slots and vmem are consumables), and qquota isn't showing any rqs > limits have been reached. > > > > Below I've dumped relevant configurations. > > > > Today I created a new PE called "int_test" to test the "integer" > allocation rule. I set it to 16 (16 cores per node), and have also tried 8. > It's been added as a PE to the queues we use. When I try to run to this new > PE however, it *always* fails with the same "PE ...offers 0 slots" error, > even if I can run the same multi-slot job using "unihost" PE at the same > time. I'm not sure if this helps debug or not. > > > > Another thought - this behavior started happening some time ago more or > less when I tried implementing fairshare behavior. I never seemed to get > fairshare working right. We haven't been able to confirm, but for some > users it seems this "PE 0 slots" issue pops up only after they've been > running other jobs for a little while. So I'm wondering if I've screwed up > fairshare in some way that's causing this odd behavior. > > > > The default queue from global config file is all.q. > > There is no default queue in SGE. One specifies resource requests and SGE > will select an appropriate one. What do you refer to by this? > > Do you have any sge_request or private .sge_request? > Yes, the global sge_request has '-q all.q'. I can't remember why this was done when I first set things up years ago - I think the cluster I was migrating from was set up that way and I just copied it. Given my qconf '-ssconf' and '-sconf' output below, does something look off with my fairshare setup (and subsequent attempt to disable it)? As I mentioned, I'm wondering if something went wrong with how I set it up because this intermittent behavior may have started at the same time. -M > > > Here are various config dumps. Is there anything else that might be > helpful? > > > > Thanks for any help! This has been plaguing me. > > > > > > [root@chead ~]# qconf -sp unihost > > pe_name unihost > > slots 9999 > > user_lists NONE > > xuser_lists NONE > > start_proc_args /bin/true > > stop_proc_args /bin/true > > allocation_rule $pe_slots > > control_slaves FALSE > > job_is_first_task TRUE > > urgency_slots min > > accounting_summary FALSE > > qsort_args NONE > > > > [root@chead ~]# qconf -sp int_test > > pe_name int_test > > slots 9999 > > user_lists NONE > > xuser_lists NONE > > start_proc_args /bin/true > > stop_proc_args /bin/true > > allocation_rule 8 > > control_slaves FALSE > > job_is_first_task TRUE > > urgency_slots min > > accounting_summary FALSE > > qsort_args NONE > > > > [root@chead ~]# qconf -ssconf > > algorithm default > > schedule_interval 0:0:5 > > maxujobs 200 > > queue_sort_method load > > job_load_adjustments np_load_avg=0.50 > > load_adjustment_decay_time 0:7:30 > > load_formula np_load_avg > > schedd_job_info true > > flush_submit_sec 0 > > flush_finish_sec 0 > > params none > > reprioritize_interval 0:0:0 > > halftime 1 > > usage_weight_list cpu=0.700000,mem=0.200000,io=0.100000 > > compensation_factor 5.000000 > > weight_user 0.250000 > > weight_project 0.250000 > > weight_department 0.250000 > > weight_job 0.250000 > > weight_tickets_functional 1000 > > weight_tickets_share 100000 > > share_override_tickets TRUE > > share_functional_shares TRUE > > max_functional_jobs_to_schedule 2000 > > report_pjob_tickets TRUE > > max_pending_tasks_per_job 100 > > halflife_decay_list none > > policy_hierarchy OS > > weight_ticket 0.000000 > > weight_waiting_time 1.000000 > > weight_deadline 3600000.000000 > > weight_urgency 0.100000 > > weight_priority 1.000000 > > max_reservation 0 > > default_duration INFINITY > > > > [root@chead ~]# qconf -sconf > > #global: > > execd_spool_dir /opt/sge/default/spool > > mailer /bin/mail > > xterm /usr/bin/X11/xterm > > load_sensor none > > prolog none > > epilog none > > shell_start_mode posix_compliant > > login_shells sh,bash,ksh,csh,tcsh > > min_uid 0 > > min_gid 0 > > user_lists none > > xuser_lists none > > projects none > > xprojects none > > enforce_project false > > enforce_user auto > > load_report_time 00:00:40 > > max_unheard 00:05:00 > > reschedule_unknown 02:00:00 > > loglevel log_warning > > administrator_mail none > > set_token_cmd none > > pag_cmd none > > token_extend_time none > > shepherd_cmd none > > qmaster_params none > > execd_params ENABLE_BINDING=true > > reporting_params accounting=true reporting=true \ > > flush_time=00:00:15 joblog=true > sharelog=00:00:00 > > finished_jobs 100 > > gid_range 20000-20100 > > qlogin_command /opt/sge/bin/cfn-qlogin.sh > > qlogin_daemon /usr/sbin/sshd -i > > rlogin_command builtin > > rlogin_daemon builtin > > rsh_command builtin > > rsh_daemon builtin > > max_aj_instances 2000 > > max_aj_tasks 75000 > > max_u_jobs 4000 > > max_jobs 0 > > max_advance_reservations 0 > > auto_user_oticket 0 > > auto_user_fshare 100 > > auto_user_default_project none > > auto_user_delete_time 0 > > delegated_file_staging false > > reprioritize 0 > > jsv_url none > > jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w > > > > [root@chead ~]# qconf -sq all.q > > qname all.q > > hostlist @allhosts > > seq_no 0 > > load_thresholds np_load_avg=1.75 > > suspend_thresholds NONE > > nsuspend 1 > > suspend_interval 00:05:00 > > priority 0 > > min_cpu_interval 00:05:00 > > processors UNDEFINED > > qtype BATCH > > ckpt_list NONE > > pe_list make mpich mpi orte unihost serial int_test > unihost2 > > rerun FALSE > > slots 1,[compute-0-0.local=4],[compute-0-1.local=15], \ > > [compute-0-2.local=15],[compute-0-3.local=15], \ > > [compute-0-4.local=15],[compute-0-5.local=15], \ > > [compute-0-6.local=16],[compute-0-7.local=16], \ > > [compute-0-9.local=16],[compute-0-10.local=16], \ > > [compute-0-11.local=16],[compute-0-12.local=16], \ > > [compute-0-13.local=16],[compute-0-14.local=16], \ > > [compute-0-15.local=16],[compute-0-16.local=16], \ > > [compute-0-17.local=16],[compute-0-18.local=16], \ > > [compute-0-8.local=16],[compute-0-19.local=14], \ > > [compute-0-20.local=4],[compute-gpu-0.local=4] > > tmpdir /tmp > > shell /bin/bash > > prolog NONE > > epilog NONE > > shell_start_mode posix_compliant > > starter_method NONE > > suspend_method NONE > > resume_method NONE > > terminate_method NONE > > notify 00:00:60 > > owner_list NONE > > user_lists NONE > > xuser_lists NONE > > subordinate_list NONE > > complex_values NONE > > projects NONE > > xprojects NONE > > calendar NONE > > initial_state default > > s_rt INFINITY > > h_rt INFINITY > > s_cpu INFINITY > > h_cpu INFINITY > > s_fsize INFINITY > > h_fsize INFINITY > > s_data INFINITY > > h_data INFINITY > > s_stack INFINITY > > h_stack INFINITY > > s_core INFINITY > > h_core INFINITY > > s_rss INFINITY > > h_rss INFINITY > > s_vmem 40G,[compute-0-20.local=3.2G], \ > > [compute-gpu-0.local=3.2G],[compute-0-19.local=5G] > > h_vmem 40G,[compute-0-20.local=3.2G], \ > > [compute-gpu-0.local=3.2G],[compute-0-19.local=5G] > > > > qstat -j on a stuck job as an example: > > > > [mgstauff@chead ~]$ qstat -j 3714924 > > ============================================================== > > job_number: 3714924 > > exec_file: job_scripts/3714924 > > submission_time: Fri Aug 11 12:48:47 2017 > > owner: mgstauff > > uid: 2198 > > group: mgstauff > > gid: 2198 > > sge_o_home: /home/mgstauff > > sge_o_log_name: mgstauff > > sge_o_path: /share/apps/mricron/ver_2015_ > 06_01:/share/apps/afni/linux_xorg7_64_2014_06_16:/share/ > apps/c3d/c3d-1.0.0-Linux-x86_64/bin:/share/apps/freesurfer/ > 5.3.0/bin:/share/apps/freesurfer/5.3.0/fsfast/bin:/ > share/apps/freesurfer/5.3.0/tktools:/share/apps/fsl/5.0.8/ > bin:/share/apps/freesurfer/5.3.0/mni/bin:/share/apps/fsl/5. > 0.8/bin:/share/apps/pandoc/1.12.4.2-in-rstudio/:/opt/ > openmpi/bin:/usr/lib64/qt-3.3/bin:/opt/sge/bin:/opt/sge/bin/ > lx-amd64:/opt/sge/bin:/opt/sge/bin/lx-amd64:/share/admin: > /opt/perfsonar_ps/toolkit/scripts:/usr/dbxml-2.3.11/bin: > /usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/ > sbin:/opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/ > EMBOSS/bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/ > bio/hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/ > fasta:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/ > bio/gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/ > bio/autodocksuite/bin:/opt/bio/wgs/bin:/opt/ganglia/bin:/ > opt/ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/ > opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/dell/ > srvadmin/bin:/home/mgstauff/bin:/share/apps/R/R-3.1.1/bin: > /share/apps/rstudio/rstudio-0.98.1091/bin/:/share/apps/ANTs/ > 2014-06-23/build/bin/:/share/apps/matlab/R2014b/bin/:/ > share/apps/BrainVISA/brainvisa-Mandriva-2008.0-x86_ > 64-4.4.0-2013_11_18:/share/apps/MIPAV/7.1.0_release:/ > share/apps/itksnap/itksnap-most-recent/bin/:/share/apps/ > MRtrix3/2016-04-25/mrtrix3/release/bin/:/share/apps/VoxBo/bin > > sge_o_shell: /bin/bash > > sge_o_workdir: /home/mgstauff > > sge_o_host: chead > > account: sge > > hard resource_list: h_stack=128m > > mail_list: mgstauff@chead.local > > notify: FALSE > > job_name: myjobparam > > jobshare: 0 > > hard_queue_list: all.q > > env_list: TERM=NONE > > job_args: 5 > > script_file: workshop-files/myjobparam > > parallel environment: int_test range: 2 > > binding: set linear:2 > > job_type: NONE > > scheduling info: queue instance "gpu.q@compute-gpu-0.local" > dropped because it is temporarily not available > > queue instance > > "qlogin.gpu.q@compute-gpu-0.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-18.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-17.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-16.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-13.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-15.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-14.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-12.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-11.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-10.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-9.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-5.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-6.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-7.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-8.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-4.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-2.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-1.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-0.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-20.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-19.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-0-3.local" > dropped because it is temporarily not available > > queue instance "reboot.q@compute-gpu-0.local" > dropped because it is temporarily not available > > queue instance > > "qlogin.long.q@compute-0-20.local" > dropped because it is full > > queue instance > > "qlogin.long.q@compute-0-19.local" > dropped because it is full > > queue instance > > "qlogin.long.q@compute-gpu-0.local" > dropped because it is full > > queue instance "basic.q@compute-1-2.local" > dropped because it is full > > queue instance "himem.q@compute-0-13.local" > dropped because it is full > > queue instance "himem.q@compute-0-4.local" > dropped because it is full > > queue instance "himem.q@compute-0-2.local" > dropped because it is full > > queue instance "himem.q@compute-0-12.local" > dropped because it is full > > queue instance "himem.q@compute-0-17.local" > dropped because it is full > > queue instance "himem.q@compute-0-3.local" > dropped because it is full > > queue instance "himem.q@compute-0-8.local" > dropped because it is full > > queue instance "himem.q@compute-0-5.local" > dropped because it is full > > queue instance "himem.q@compute-0-11.local" > dropped because it is full > > queue instance "himem.q@compute-0-15.local" > dropped because it is full > > queue instance "himem.q@compute-0-7.local" > dropped because it is full > > queue instance "himem.q@compute-0-14.local" > dropped because it is full > > queue instance "himem.q@compute-0-18.local" > dropped because it is full > > queue instance "himem.q@compute-0-10.local" > dropped because it is full > > queue instance "himem.q@compute-0-6.local" > dropped because it is full > > queue instance "himem.q@compute-gpu-0.local" > dropped because it is full > > queue instance "himem.q@compute-0-16.local" > dropped because it is full > > queue instance "himem.q@compute-0-9.local" > dropped because it is full > > queue instance "himem.q@compute-0-0.local" > dropped because it is full > > queue instance "himem.q@compute-0-1.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-13.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-4.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-2.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-12.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-17.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-3.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-8.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-5.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-11.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-15.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-7.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-14.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-18.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-10.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-6.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-gpu-0.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-16.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-9.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-0.local" > dropped because it is full > > queue instance > > "qlogin.himem.q@compute-0-1.local" > dropped because it is full > > queue instance "qlogin.q@compute-0-20.local" > dropped because it is full > > queue instance "qlogin.q@compute-0-19.local" > dropped because it is full > > queue instance "qlogin.q@compute-gpu-0.local" > dropped because it is full > > queue instance "qlogin.q@compute-0-7.local" > dropped because it is full > > queue instance "all.q@compute-0-0.local" > dropped because it is full > > cannot run in PE "int_test" because it only > offers 0 slots > > > > [mgstauff@chead ~]$ qquota -u mgstauff > > resource quota rule limit filter > > ------------------------------------------------------------ > -------------------- > > > > [mgstauff@chead ~]$ qconf -srqs limit_user_slots > > { > > name limit_user_slots > > description Limit the users' batch slots > > enabled TRUE > > limit users {pcook,mgstauff} queues {allalt.q} to slots=32 > > limit users {*} queues {allalt.q} to slots=0 > > limit users {*} queues {himem.q} to slots=6 > > limit users {*} queues {all.q,himem.q} to slots=32 > > limit users {*} queues {basic.q} to slots=40 > > } > > > > There are plenty of consumables available: > > > > [root@chead ~]# qstat -F h_vmem,s_vmem,slots -q all.q a > > queuename qtype resv/used/tot. load_avg arch > states > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-0.local BP 0/4/4 5.24 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > qc:slots=0 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-1.local BP 0/10/15 9.58 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > qc:slots=5 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-10.local BP 0/9/16 9.80 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=7 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-11.local BP 0/11/16 9.18 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=5 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-12.local BP 0/11/16 9.72 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=5 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-13.local BP 0/10/16 9.14 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-14.local BP 0/10/16 9.66 lx-amd64 > > hc:h_vmem=28.890G > > hc:s_vmem=30.990G > > hc:slots=6 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-15.local BP 0/10/16 9.54 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-16.local BP 0/10/16 10.01 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-17.local BP 0/11/16 9.75 lx-amd64 > > hc:h_vmem=29.963G > > hc:s_vmem=32.960G > > hc:slots=5 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-18.local BP 0/11/16 10.29 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=5 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-19.local BP 0/9/14 9.01 lx-amd64 > > qf:h_vmem=5.000G > > qf:s_vmem=5.000G > > qc:slots=5 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-2.local BP 0/10/15 9.24 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > qc:slots=5 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-20.local BP 0/0/4 0.00 lx-amd64 > > qf:h_vmem=3.200G > > qf:s_vmem=3.200G > > qc:slots=4 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-3.local BP 0/11/15 9.62 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > qc:slots=4 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-4.local BP 0/12/15 9.85 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > qc:slots=3 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-5.local BP 0/12/15 10.18 lx-amd64 > > hc:h_vmem=36.490G > > hc:s_vmem=39.390G > > qc:slots=3 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-6.local BP 0/12/16 9.95 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=4 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-7.local BP 0/10/16 9.59 lx-amd64 > > hc:h_vmem=36.935G > > qf:s_vmem=40.000G > > hc:slots=5 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-8.local BP 0/10/16 9.37 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-0-9.local BP 0/10/16 9.38 lx-amd64 > > qf:h_vmem=40.000G > > qf:s_vmem=40.000G > > hc:slots=6 > > ------------------------------------------------------------ > --------------------- > > all.q@compute-gpu-0.local BP 0/0/4 0.05 lx-amd64 > > qf:h_vmem=3.200G > > qf:s_vmem=3.200G > > qc:slots=4 > > > > > > On Mon, Feb 13, 2017 at 2:42 PM, Jesse Becker <becke...@mail.nih.gov> > wrote: > > On Mon, Feb 13, 2017 at 02:26:18PM -0500, Michael Stauffer wrote: > > SoGE 8.1.8 > > > > Hi, > > > > I'm getting some queued jobs with scheduling info that includes this line > > at the end: > > > > cannot run in PE "unihost" because it only offers 0 slots > > > > 'unihost' is the only PE I use. When users request multiple slots, they > use > > 'unihost': > > > > ... -binding linear:2 -pe unihost 2 ... > > > > What happens is that these jobs aren't running when it otherwise seems > like > > they should be, or they sit waiting in the queue for a long time even > when > > the user has plenty of quota available within the queue they've > requested, > > and there are enough resources available on the queue's nodes (slots and > > vram are consumables). > > > > Any suggestions about how I might further understand this? > > > > This *exact* problem has bitten me in the past. It seems to crop up > > about every 3 years--long enough to remember it was a problem, and long > > enough to forget just what the [censored] I did to fix it. > > > > As I recall, it has little to do with actual PEs, but everything to do > > with complexes and resource requests. > > > > You might glean a bit more information by running "qsub -w p" (or "-w > e"). > > > > Take a look at these previous discussions: > > > > http://gridengine.org/pipermail/users/2011-November/001932.html > > http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/1700 > > > > > > -- > > Jesse Becker (Contractor) > > > > _______________________________________________ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users