Hi, I'm getting back to this post finally. I've looked at the links and suggestions in the two replies to my original post a few months ago, but they haven't helped. Here's my original:
I'm getting some queued jobs with scheduling info that includes this line at the end: cannot run in PE "unihost" because it only offers 0 slots 'unihost' is the only PE I use. When users request multiple slots, they use 'unihost': qsub ... -binding linear:2 -pe unihost 2 ... What happens is that these jobs aren't running when it otherwise seems like they should be, or they sit waiting in the queue for a long time even when the user has plenty of quota available within the queue they've requested, and there are enough resources available on the queue's nodes per qhost(slots and vmem are consumables), and qquota isn't showing any rqs limits have been reached. Below I've dumped relevant configurations. Today I created a new PE called "int_test" to test the "integer" allocation rule. I set it to 16 (16 cores per node), and have also tried 8. It's been added as a PE to the queues we use. When I try to run to this new PE however, it *always* fails with the same "PE ...offers 0 slots" error, even if I can run the same multi-slot job using "unihost" PE at the same time. I'm not sure if this helps debug or not. Another thought - this behavior started happening some time ago more or less when I tried implementing fairshare behavior. I never seemed to get fairshare working right. We haven't been able to confirm, but for some users it seems this "PE 0 slots" issue pops up only after they've been running other jobs for a little while. So I'm wondering if I've screwed up fairshare in some way that's causing this odd behavior. The default queue from global config file is all.q. Here are various config dumps. Is there anything else that might be helpful? Thanks for any help! This has been plaguing me. [root@chead ~]# qconf -sp unihost pe_name unihost slots 9999 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $pe_slots control_slaves FALSE job_is_first_task TRUE urgency_slots min accounting_summary FALSE qsort_args NONE [root@chead ~]# qconf -sp int_test pe_name int_test slots 9999 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule 8 control_slaves FALSE job_is_first_task TRUE urgency_slots min accounting_summary FALSE qsort_args NONE [root@chead ~]# qconf -ssconf algorithm default schedule_interval 0:0:5 maxujobs 200 queue_sort_method load job_load_adjustments np_load_avg=0.50 load_adjustment_decay_time 0:7:30 load_formula np_load_avg schedd_job_info true flush_submit_sec 0 flush_finish_sec 0 params none reprioritize_interval 0:0:0 halftime 1 usage_weight_list cpu=0.700000,mem=0.200000,io=0.100000 compensation_factor 5.000000 weight_user 0.250000 weight_project 0.250000 weight_department 0.250000 weight_job 0.250000 weight_tickets_functional 1000 weight_tickets_share 100000 share_override_tickets TRUE share_functional_shares TRUE max_functional_jobs_to_schedule 2000 report_pjob_tickets TRUE max_pending_tasks_per_job 100 halflife_decay_list none policy_hierarchy OS weight_ticket 0.000000 weight_waiting_time 1.000000 weight_deadline 3600000.000000 weight_urgency 0.100000 weight_priority 1.000000 max_reservation 0 default_duration INFINITY [root@chead ~]# qconf -sconf #global: execd_spool_dir /opt/sge/default/spool mailer /bin/mail xterm /usr/bin/X11/xterm load_sensor none prolog none epilog none shell_start_mode posix_compliant login_shells sh,bash,ksh,csh,tcsh min_uid 0 min_gid 0 user_lists none xuser_lists none projects none xprojects none enforce_project false enforce_user auto load_report_time 00:00:40 max_unheard 00:05:00 reschedule_unknown 02:00:00 loglevel log_warning administrator_mail none set_token_cmd none pag_cmd none token_extend_time none shepherd_cmd none qmaster_params none execd_params ENABLE_BINDING=true reporting_params accounting=true reporting=true \ flush_time=00:00:15 joblog=true sharelog=00:00:00 finished_jobs 100 gid_range 20000-20100 qlogin_command /opt/sge/bin/cfn-qlogin.sh qlogin_daemon /usr/sbin/sshd -i rlogin_command builtin rlogin_daemon builtin rsh_command builtin rsh_daemon builtin max_aj_instances 2000 max_aj_tasks 75000 max_u_jobs 4000 max_jobs 0 max_advance_reservations 0 auto_user_oticket 0 auto_user_fshare 100 auto_user_default_project none auto_user_delete_time 0 delegated_file_staging false reprioritize 0 jsv_url none jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w [root@chead ~]# qconf -sq all.q qname all.q hostlist @allhosts seq_no 0 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH ckpt_list NONE pe_list make mpich mpi orte unihost serial int_test unihost2 rerun FALSE slots 1,[compute-0-0.local=4],[compute-0-1.local=15], \ [compute-0-2.local=15],[compute-0-3.local=15], \ [compute-0-4.local=15],[compute-0-5.local=15], \ [compute-0-6.local=16],[compute-0-7.local=16], \ [compute-0-9.local=16],[compute-0-10.local=16], \ [compute-0-11.local=16],[compute-0-12.local=16], \ [compute-0-13.local=16],[compute-0-14.local=16], \ [compute-0-15.local=16],[compute-0-16.local=16], \ [compute-0-17.local=16],[compute-0-18.local=16], \ [compute-0-8.local=16],[compute-0-19.local=14], \ [compute-0-20.local=4],[compute-gpu-0.local=4] tmpdir /tmp shell /bin/bash prolog NONE epilog NONE shell_start_mode posix_compliant starter_method NONE suspend_method NONE resume_method NONE terminate_method NONE notify 00:00:60 owner_list NONE user_lists NONE xuser_lists NONE subordinate_list NONE complex_values NONE projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt INFINITY s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_data INFINITY h_data INFINITY s_stack INFINITY h_stack INFINITY s_core INFINITY h_core INFINITY s_rss INFINITY h_rss INFINITY s_vmem 40G,[compute-0-20.local=3.2G], \ [compute-gpu-0.local=3.2G],[compute-0-19.local=5G] h_vmem 40G,[compute-0-20.local=3.2G], \ [compute-gpu-0.local=3.2G],[compute-0-19.local=5G] qstat -j on a stuck job as an example: [mgstauff@chead ~]$ qstat -j 3714924 ============================================================== job_number: 3714924 exec_file: job_scripts/3714924 submission_time: Fri Aug 11 12:48:47 2017 owner: mgstauff uid: 2198 group: mgstauff gid: 2198 sge_o_home: /home/mgstauff sge_o_log_name: mgstauff sge_o_path: /share/apps/mricron/ver_2015_06_01:/share/apps/afni/linux_xorg7_64_2014_06_16:/share/apps/c3d/c3d-1.0.0-Linux-x86_64/bin:/share/apps/freesurfer/5.3.0/bin:/share/apps/freesurfer/5.3.0/fsfast/bin:/share/apps/freesurfer/5.3.0/tktools:/share/apps/fsl/5.0.8/bin:/share/apps/freesurfer/5.3.0/mni/bin:/share/apps/fsl/5.0.8/bin:/share/apps/pandoc/1.12.4.2-in-rstudio/:/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/opt/sge/bin:/opt/sge/bin/lx-amd64:/opt/sge/bin:/opt/sge/bin/lx-amd64:/share/admin:/opt/perfsonar_ps/toolkit/scripts:/usr/dbxml-2.3.11/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/EMBOSS/bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/bio/hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/fasta:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/bio/gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/bio/autodocksuite/bin:/opt/bio/wgs/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/dell/srvadmin/bin:/home/mgstauff/bin:/share/apps/R/R-3.1.1/bin:/share/apps/rstudio/rstudio-0.98.1091/bin/:/share/apps/ANTs/2014-06-23/build/bin/:/share/apps/matlab/R2014b/bin/:/share/apps/BrainVISA/brainvisa-Mandriva-2008.0-x86_64-4.4.0-2013_11_18:/share/apps/MIPAV/7.1.0_release:/share/apps/itksnap/itksnap-most-recent/bin/:/share/apps/MRtrix3/2016-04-25/mrtrix3/release/bin/:/share/apps/VoxBo/bin sge_o_shell: /bin/bash sge_o_workdir: /home/mgstauff sge_o_host: chead account: sge hard resource_list: h_stack=128m mail_list: mgstauff@chead.local notify: FALSE job_name: myjobparam jobshare: 0 hard_queue_list: all.q env_list: TERM=NONE job_args: 5 script_file: workshop-files/myjobparam parallel environment: int_test range: 2 binding: set linear:2 job_type: NONE scheduling info: queue instance "gpu.q@compute-gpu-0.local" dropped because it is temporarily not available queue instance "qlogin.gpu.q@compute-gpu-0.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-18.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-17.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-16.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-13.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-15.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-14.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-12.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-11.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-10.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-9.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-5.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-6.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-7.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-8.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-4.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-2.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-1.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-0.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-20.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-19.local" dropped because it is temporarily not available queue instance "reboot.q@compute-0-3.local" dropped because it is temporarily not available queue instance "reboot.q@compute-gpu-0.local" dropped because it is temporarily not available queue instance "qlogin.long.q@compute-0-20.local" dropped because it is full queue instance "qlogin.long.q@compute-0-19.local" dropped because it is full queue instance "qlogin.long.q@compute-gpu-0.local" dropped because it is full queue instance "basic.q@compute-1-2.local" dropped because it is full queue instance "himem.q@compute-0-13.local" dropped because it is full queue instance "himem.q@compute-0-4.local" dropped because it is full queue instance "himem.q@compute-0-2.local" dropped because it is full queue instance "himem.q@compute-0-12.local" dropped because it is full queue instance "himem.q@compute-0-17.local" dropped because it is full queue instance "himem.q@compute-0-3.local" dropped because it is full queue instance "himem.q@compute-0-8.local" dropped because it is full queue instance "himem.q@compute-0-5.local" dropped because it is full queue instance "himem.q@compute-0-11.local" dropped because it is full queue instance "himem.q@compute-0-15.local" dropped because it is full queue instance "himem.q@compute-0-7.local" dropped because it is full queue instance "himem.q@compute-0-14.local" dropped because it is full queue instance "himem.q@compute-0-18.local" dropped because it is full queue instance "himem.q@compute-0-10.local" dropped because it is full queue instance "himem.q@compute-0-6.local" dropped because it is full queue instance "himem.q@compute-gpu-0.local" dropped because it is full queue instance "himem.q@compute-0-16.local" dropped because it is full queue instance "himem.q@compute-0-9.local" dropped because it is full queue instance "himem.q@compute-0-0.local" dropped because it is full queue instance "himem.q@compute-0-1.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-13.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-4.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-2.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-12.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-17.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-3.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-8.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-5.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-11.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-15.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-7.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-14.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-18.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-10.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-6.local" dropped because it is full queue instance "qlogin.himem.q@compute-gpu-0.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-16.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-9.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-0.local" dropped because it is full queue instance "qlogin.himem.q@compute-0-1.local" dropped because it is full queue instance "qlogin.q@compute-0-20.local" dropped because it is full queue instance "qlogin.q@compute-0-19.local" dropped because it is full queue instance "qlogin.q@compute-gpu-0.local" dropped because it is full queue instance "qlogin.q@compute-0-7.local" dropped because it is full queue instance "all.q@compute-0-0.local" dropped because it is full cannot run in PE "int_test" because it only offers 0 slots [mgstauff@chead ~]$ qquota -u mgstauff resource quota rule limit filter -------------------------------------------------------------------------------- [mgstauff@chead ~]$ qconf -srqs limit_user_slots { name limit_user_slots description Limit the users' batch slots enabled TRUE limit users {pcook,mgstauff} queues {allalt.q} to slots=32 limit users {*} queues {allalt.q} to slots=0 limit users {*} queues {himem.q} to slots=6 limit users {*} queues {all.q,himem.q} to slots=32 limit users {*} queues {basic.q} to slots=40 } There are plenty of consumables available: [root@chead ~]# qstat -F h_vmem,s_vmem,slots -q all.q a queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- all.q@compute-0-0.local BP 0/4/4 5.24 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G qc:slots=0 --------------------------------------------------------------------------------- all.q@compute-0-1.local BP 0/10/15 9.58 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G qc:slots=5 --------------------------------------------------------------------------------- all.q@compute-0-10.local BP 0/9/16 9.80 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G hc:slots=7 --------------------------------------------------------------------------------- all.q@compute-0-11.local BP 0/11/16 9.18 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G hc:slots=5 --------------------------------------------------------------------------------- all.q@compute-0-12.local BP 0/11/16 9.72 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G hc:slots=5 --------------------------------------------------------------------------------- all.q@compute-0-13.local BP 0/10/16 9.14 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G hc:slots=6 --------------------------------------------------------------------------------- all.q@compute-0-14.local BP 0/10/16 9.66 lx-amd64 hc:h_vmem=28.890G hc:s_vmem=30.990G hc:slots=6 --------------------------------------------------------------------------------- all.q@compute-0-15.local BP 0/10/16 9.54 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G hc:slots=6 --------------------------------------------------------------------------------- all.q@compute-0-16.local BP 0/10/16 10.01 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G hc:slots=6 --------------------------------------------------------------------------------- all.q@compute-0-17.local BP 0/11/16 9.75 lx-amd64 hc:h_vmem=29.963G hc:s_vmem=32.960G hc:slots=5 --------------------------------------------------------------------------------- all.q@compute-0-18.local BP 0/11/16 10.29 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G hc:slots=5 --------------------------------------------------------------------------------- all.q@compute-0-19.local BP 0/9/14 9.01 lx-amd64 qf:h_vmem=5.000G qf:s_vmem=5.000G qc:slots=5 --------------------------------------------------------------------------------- all.q@compute-0-2.local BP 0/10/15 9.24 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G qc:slots=5 --------------------------------------------------------------------------------- all.q@compute-0-20.local BP 0/0/4 0.00 lx-amd64 qf:h_vmem=3.200G qf:s_vmem=3.200G qc:slots=4 --------------------------------------------------------------------------------- all.q@compute-0-3.local BP 0/11/15 9.62 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G qc:slots=4 --------------------------------------------------------------------------------- all.q@compute-0-4.local BP 0/12/15 9.85 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G qc:slots=3 --------------------------------------------------------------------------------- all.q@compute-0-5.local BP 0/12/15 10.18 lx-amd64 hc:h_vmem=36.490G hc:s_vmem=39.390G qc:slots=3 --------------------------------------------------------------------------------- all.q@compute-0-6.local BP 0/12/16 9.95 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G hc:slots=4 --------------------------------------------------------------------------------- all.q@compute-0-7.local BP 0/10/16 9.59 lx-amd64 hc:h_vmem=36.935G qf:s_vmem=40.000G hc:slots=5 --------------------------------------------------------------------------------- all.q@compute-0-8.local BP 0/10/16 9.37 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G hc:slots=6 --------------------------------------------------------------------------------- all.q@compute-0-9.local BP 0/10/16 9.38 lx-amd64 qf:h_vmem=40.000G qf:s_vmem=40.000G hc:slots=6 --------------------------------------------------------------------------------- all.q@compute-gpu-0.local BP 0/0/4 0.05 lx-amd64 qf:h_vmem=3.200G qf:s_vmem=3.200G qc:slots=4 On Mon, Feb 13, 2017 at 2:42 PM, Jesse Becker <becke...@mail.nih.gov> wrote: > On Mon, Feb 13, 2017 at 02:26:18PM -0500, Michael Stauffer wrote: > >> SoGE 8.1.8 >> >> Hi, >> >> I'm getting some queued jobs with scheduling info that includes this line >> at the end: >> >> cannot run in PE "unihost" because it only offers 0 slots >> >> 'unihost' is the only PE I use. When users request multiple slots, they >> use >> 'unihost': >> >> ... -binding linear:2 -pe unihost 2 ... >> >> What happens is that these jobs aren't running when it otherwise seems >> like >> they should be, or they sit waiting in the queue for a long time even when >> the user has plenty of quota available within the queue they've requested, >> and there are enough resources available on the queue's nodes (slots and >> vram are consumables). >> >> Any suggestions about how I might further understand this? >> > > This *exact* problem has bitten me in the past. It seems to crop up > about every 3 years--long enough to remember it was a problem, and long > enough to forget just what the [censored] I did to fix it. > > As I recall, it has little to do with actual PEs, but everything to do > with complexes and resource requests. > > You might glean a bit more information by running "qsub -w p" (or "-w e"). > > Take a look at these previous discussions: > > http://gridengine.org/pipermail/users/2011-November/001932.html > http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/1700 > > > -- > Jesse Becker (Contractor) >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users