I have some more information. We have two sets of exec hosts on the cluster, one in a the host group/hostlist "@allhosts" that is assigned to the queue all.q. The other is in the group "@basichosts", which is assigned to a queue called basic.q
When we're having the trouble with multi-slot/core jobs not running for a user on all.q, the same jobs can be resubmitted (or added via qalter) to basic.q, and they will run immediately. I made a duplicate queue of all.q, called allalt.q. The same problem happens with jobs getting stuck in queue. When I change the hostlist in allalt.q, and nothing else, from @allhosts to @basichosts, the stuck jobs run immediately. (Again, this is happenning when there are plenty of resources reported available on all.q hosts, and the user's quotas are either empty or not maxed.) Here's the definitions of a host from each of the groups: A host from all.q's group, @allhosts, where jobs get stuck: [root@chead ~]# qconf -se compute-0-1 hostname compute-0-1.local load_scaling NONE complex_values h_vmem=125.49G,s_vmem=125.49G,slots=16.000000 load_values arch=lx-amd64,num_proc=16,mem_total=64508.523438M, \ swap_total=31999.996094M,virtual_total=96508.519531M, \ m_topology=SCCCCCCCCSCCCCCCCC,m_socket=2,m_core=16, \ m_thread=16,load_avg=7.590000,load_short=7.660000, \ load_medium=7.590000,load_long=7.300000, \ mem_free=53815.035156M,swap_free=31834.675781M, \ virtual_free=85649.710938M,mem_used=10693.488281M, \ swap_used=165.320312M,virtual_used=10858.808594M, \ cpu=42.800000,m_topology_inuse=SccccCCCCSccCccCCC, \ np_load_avg=0.474375,np_load_short=0.478750, \ np_load_medium=0.474375,np_load_long=0.456250 processors 16 user_lists NONE xuser_lists NONE projects NONE xprojects NONE usage_scaling NONE report_variables NONE And a host from basic.q's group, @basichosts, where jobs run immediately: [root@chead ~]# qconf -se compute-1-0 hostname compute-1-0.local load_scaling NONE complex_values h_vmem=19.02G,s_vmem=19.02G,slots=8.000000 load_values arch=lx-amd64,num_proc=8,mem_total=16077.441406M, \ swap_total=3999.996094M,virtual_total=20077.437500M, \ m_topology=SCCCCSCCCC,m_socket=2,m_core=8,m_thread=8, \ load_avg=1.680000,load_short=2.420000, \ load_medium=1.680000,load_long=1.790000, \ mem_free=13408.687500M,swap_free=3973.464844M, \ virtual_free=17382.152344M,mem_used=2668.753906M, \ swap_used=26.531250M,virtual_used=2695.285156M, \ cpu=16.400000,m_topology_inuse=SccCCScCCC, \ np_load_avg=0.210000,np_load_short=0.302500, \ np_load_medium=0.210000,np_load_long=0.223750 processors 8 user_lists NONE xuser_lists NONE projects NONE xprojects NONE usage_scaling NONE report_variables NONE Here's the full complex config. 'slots' are listed as "YES" under consumable, whereas s_vmem and h_vmem are listed as "JOB". Seems this should be OK, but maybe not? Also 'slots' has urgency 1000, whereas others have 0. [root@chead ~]# qconf -sc #name shortcut type relop requestable consumable default urgency #---------------------------------------------------------------------------------------- arch a RESTRING == YES NO NONE 0 calendar c RESTRING == YES NO NONE 0 cpu cpu DOUBLE >= YES NO 0 0 display_win_gui dwg BOOL == YES NO 0 0 h_core h_core MEMORY <= YES NO 0 0 h_cpu h_cpu TIME <= YES NO 0:0:0 0 h_data h_data MEMORY <= YES NO 0 0 h_fsize h_fsize MEMORY <= YES NO 0 0 h_rss h_rss MEMORY <= YES NO 0 0 h_rt h_rt TIME <= YES NO 0:0:0 0 h_stack h_stack MEMORY <= YES NO 0 0 h_vmem h_vmem MEMORY <= YES JOB 3100M 0 hostname h HOST == YES NO NONE 0 load_avg la DOUBLE >= NO NO 0 0 load_long ll DOUBLE >= NO NO 0 0 load_medium lm DOUBLE >= NO NO 0 0 load_short ls DOUBLE >= NO NO 0 0 m_core core INT <= YES NO 0 0 m_socket socket INT <= YES NO 0 0 m_thread thread INT <= YES NO 0 0 m_topology topo RESTRING == YES NO NONE 0 m_topology_inuse utopo RESTRING == YES NO NONE 0 mem_free mf MEMORY <= YES NO 0 0 mem_total mt MEMORY <= YES NO 0 0 mem_used mu MEMORY >= YES NO 0 0 min_cpu_interval mci TIME <= NO NO 0:0:0 0 np_load_avg nla DOUBLE >= NO NO 0 0 np_load_long nll DOUBLE >= NO NO 0 0 np_load_medium nlm DOUBLE >= NO NO 0 0 np_load_short nls DOUBLE >= NO NO 0 0 num_proc p INT == YES NO 0 0 qname q RESTRING == YES NO NONE 0 rerun re BOOL == NO NO 0 0 s_core s_core MEMORY <= YES NO 0 0 s_cpu s_cpu TIME <= YES NO 0:0:0 0 s_data s_data MEMORY <= YES NO 0 0 s_fsize s_fsize MEMORY <= YES NO 0 0 s_rss s_rss MEMORY <= YES NO 0 0 s_rt s_rt TIME <= YES NO 0:0:0 0 s_stack s_stack MEMORY <= YES NO 0 0 s_vmem s_vmem MEMORY <= YES JOB 3000M 0 seq_no seq INT == NO NO 0 0 slots s INT <= YES YES 1 1000 swap_free sf MEMORY <= YES NO 0 0 swap_rate sr MEMORY >= YES NO 0 0 swap_rsvd srsv MEMORY >= YES NO 0 0 swap_total st MEMORY <= YES NO 0 0 swap_used su MEMORY >= YES NO 0 0 tmpdir tmp RESTRING == NO NO NONE 0 virtual_free vf MEMORY <= YES NO 0 0 virtual_total vt MEMORY <= YES NO 0 0 virtual_used vu MEMORY >= YES NO 0 0 Does this info help at all in diagnosing? Any other config info that could help understand this? -M On Sun, Aug 13, 2017 at 12:11 PM, Michael Stauffer <mgsta...@gmail.com> wrote: > Thanks for the reply Reuti, see below > > On Fri, Aug 11, 2017 at 7:18 PM, Reuti <re...@staff.uni-marburg.de> wrote: > >> >> What I notice below: defining h_vmem/s_vmem on a queue level means per >> job. Defining it on an exechost level means across all jobs. What is >> different between: >> >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-13.local BP 0/10/16 9.14 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > hc:slots=6 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-14.local BP 0/10/16 9.66 lx-amd64 >> > hc:h_vmem=28.890G >> > hc:s_vmem=30.990G >> > hc:slots=6 >> >> >> qf = queue fixed >> hc = host consumable >> >> What is the definition of h_vmem/s_vmem in `qconf -sc` and their default >> consumptions? >> > > I thought this means that when it's showing qf, it's the per-job queue > limit, i.e. the queue has a h_vmem and s_vmem limits for the job of 40G > (which it does). And then hc is shown when the host resources are less than > the per-job queue limit. > > [root@chead ~]# qconf -sc | grep vmem > h_vmem h_vmem MEMORY <= YES JOB > 3100M 0 > s_vmem s_vmem MEMORY <= YES JOB > 3000M 0 > > > 'unihost' is the only PE I use. When users request multiple slots, they >> use 'unihost': >> > >> > qsub ... -binding linear:2 -pe unihost 2 ... >> > >> > What happens is that these jobs aren't running when it otherwise seems >> like they should be, or they sit waiting in the queue for a long time even >> when the user has plenty of quota available within the queue they've >> requested, and there are enough resources available on the queue's nodes >> per qhost(slots and vmem are consumables), and qquota isn't showing any rqs >> limits have been reached. >> > >> > Below I've dumped relevant configurations. >> > >> > Today I created a new PE called "int_test" to test the "integer" >> allocation rule. I set it to 16 (16 cores per node), and have also tried 8. >> It's been added as a PE to the queues we use. When I try to run to this new >> PE however, it *always* fails with the same "PE ...offers 0 slots" error, >> even if I can run the same multi-slot job using "unihost" PE at the same >> time. I'm not sure if this helps debug or not. >> > >> > Another thought - this behavior started happening some time ago more or >> less when I tried implementing fairshare behavior. I never seemed to get >> fairshare working right. We haven't been able to confirm, but for some >> users it seems this "PE 0 slots" issue pops up only after they've been >> running other jobs for a little while. So I'm wondering if I've screwed up >> fairshare in some way that's causing this odd behavior. >> > >> > The default queue from global config file is all.q. >> >> There is no default queue in SGE. One specifies resource requests and SGE >> will select an appropriate one. What do you refer to by this? >> >> Do you have any sge_request or private .sge_request? >> > > Yes, the global sge_request has '-q all.q'. I can't remember why this was > done when I first set things up years ago - I think the cluster I was > migrating from was set up that way and I just copied it. > > Given my qconf '-ssconf' and '-sconf' output below, does something look > off with my fairshare setup (and subsequent attempt to disable it)? As I > mentioned, I'm wondering if something went wrong with how I set it up > because this intermittent behavior may have started at the same time. > > -M > > > >> > Here are various config dumps. Is there anything else that might be >> helpful? >> > >> > Thanks for any help! This has been plaguing me. >> > >> > >> > [root@chead ~]# qconf -sp unihost >> > pe_name unihost >> > slots 9999 >> > user_lists NONE >> > xuser_lists NONE >> > start_proc_args /bin/true >> > stop_proc_args /bin/true >> > allocation_rule $pe_slots >> > control_slaves FALSE >> > job_is_first_task TRUE >> > urgency_slots min >> > accounting_summary FALSE >> > qsort_args NONE >> > >> > [root@chead ~]# qconf -sp int_test >> > pe_name int_test >> > slots 9999 >> > user_lists NONE >> > xuser_lists NONE >> > start_proc_args /bin/true >> > stop_proc_args /bin/true >> > allocation_rule 8 >> > control_slaves FALSE >> > job_is_first_task TRUE >> > urgency_slots min >> > accounting_summary FALSE >> > qsort_args NONE >> > >> > [root@chead ~]# qconf -ssconf >> > algorithm default >> > schedule_interval 0:0:5 >> > maxujobs 200 >> > queue_sort_method load >> > job_load_adjustments np_load_avg=0.50 >> > load_adjustment_decay_time 0:7:30 >> > load_formula np_load_avg >> > schedd_job_info true >> > flush_submit_sec 0 >> > flush_finish_sec 0 >> > params none >> > reprioritize_interval 0:0:0 >> > halftime 1 >> > usage_weight_list cpu=0.700000,mem=0.200000,io=0.100000 >> > compensation_factor 5.000000 >> > weight_user 0.250000 >> > weight_project 0.250000 >> > weight_department 0.250000 >> > weight_job 0.250000 >> > weight_tickets_functional 1000 >> > weight_tickets_share 100000 >> > share_override_tickets TRUE >> > share_functional_shares TRUE >> > max_functional_jobs_to_schedule 2000 >> > report_pjob_tickets TRUE >> > max_pending_tasks_per_job 100 >> > halflife_decay_list none >> > policy_hierarchy OS >> > weight_ticket 0.000000 >> > weight_waiting_time 1.000000 >> > weight_deadline 3600000.000000 >> > weight_urgency 0.100000 >> > weight_priority 1.000000 >> > max_reservation 0 >> > default_duration INFINITY >> > >> > [root@chead ~]# qconf -sconf >> > #global: >> > execd_spool_dir /opt/sge/default/spool >> > mailer /bin/mail >> > xterm /usr/bin/X11/xterm >> > load_sensor none >> > prolog none >> > epilog none >> > shell_start_mode posix_compliant >> > login_shells sh,bash,ksh,csh,tcsh >> > min_uid 0 >> > min_gid 0 >> > user_lists none >> > xuser_lists none >> > projects none >> > xprojects none >> > enforce_project false >> > enforce_user auto >> > load_report_time 00:00:40 >> > max_unheard 00:05:00 >> > reschedule_unknown 02:00:00 >> > loglevel log_warning >> > administrator_mail none >> > set_token_cmd none >> > pag_cmd none >> > token_extend_time none >> > shepherd_cmd none >> > qmaster_params none >> > execd_params ENABLE_BINDING=true >> > reporting_params accounting=true reporting=true \ >> > flush_time=00:00:15 joblog=true >> sharelog=00:00:00 >> > finished_jobs 100 >> > gid_range 20000-20100 >> > qlogin_command /opt/sge/bin/cfn-qlogin.sh >> > qlogin_daemon /usr/sbin/sshd -i >> > rlogin_command builtin >> > rlogin_daemon builtin >> > rsh_command builtin >> > rsh_daemon builtin >> > max_aj_instances 2000 >> > max_aj_tasks 75000 >> > max_u_jobs 4000 >> > max_jobs 0 >> > max_advance_reservations 0 >> > auto_user_oticket 0 >> > auto_user_fshare 100 >> > auto_user_default_project none >> > auto_user_delete_time 0 >> > delegated_file_staging false >> > reprioritize 0 >> > jsv_url none >> > jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w >> > >> > [root@chead ~]# qconf -sq all.q >> > qname all.q >> > hostlist @allhosts >> > seq_no 0 >> > load_thresholds np_load_avg=1.75 >> > suspend_thresholds NONE >> > nsuspend 1 >> > suspend_interval 00:05:00 >> > priority 0 >> > min_cpu_interval 00:05:00 >> > processors UNDEFINED >> > qtype BATCH >> > ckpt_list NONE >> > pe_list make mpich mpi orte unihost serial int_test >> unihost2 >> > rerun FALSE >> > slots 1,[compute-0-0.local=4],[compute-0-1.local=15], \ >> > [compute-0-2.local=15],[compute-0-3.local=15], \ >> > [compute-0-4.local=15],[compute-0-5.local=15], \ >> > [compute-0-6.local=16],[compute-0-7.local=16], \ >> > [compute-0-9.local=16],[compute-0-10.local=16], \ >> > [compute-0-11.local=16],[compute-0-12.local=16], >> \ >> > [compute-0-13.local=16],[compute-0-14.local=16], >> \ >> > [compute-0-15.local=16],[compute-0-16.local=16], >> \ >> > [compute-0-17.local=16],[compute-0-18.local=16], >> \ >> > [compute-0-8.local=16],[compute-0-19.local=14], \ >> > [compute-0-20.local=4],[compute-gpu-0.local=4] >> > tmpdir /tmp >> > shell /bin/bash >> > prolog NONE >> > epilog NONE >> > shell_start_mode posix_compliant >> > starter_method NONE >> > suspend_method NONE >> > resume_method NONE >> > terminate_method NONE >> > notify 00:00:60 >> > owner_list NONE >> > user_lists NONE >> > xuser_lists NONE >> > subordinate_list NONE >> > complex_values NONE >> > projects NONE >> > xprojects NONE >> > calendar NONE >> > initial_state default >> > s_rt INFINITY >> > h_rt INFINITY >> > s_cpu INFINITY >> > h_cpu INFINITY >> > s_fsize INFINITY >> > h_fsize INFINITY >> > s_data INFINITY >> > h_data INFINITY >> > s_stack INFINITY >> > h_stack INFINITY >> > s_core INFINITY >> > h_core INFINITY >> > s_rss INFINITY >> > h_rss INFINITY >> > s_vmem 40G,[compute-0-20.local=3.2G], \ >> > [compute-gpu-0.local=3.2G],[c >> ompute-0-19.local=5G] >> > h_vmem 40G,[compute-0-20.local=3.2G], \ >> > [compute-gpu-0.local=3.2G],[c >> ompute-0-19.local=5G] >> > >> > qstat -j on a stuck job as an example: >> > >> > [mgstauff@chead ~]$ qstat -j 3714924 >> > ============================================================== >> > job_number: 3714924 >> > exec_file: job_scripts/3714924 >> > submission_time: Fri Aug 11 12:48:47 2017 >> > owner: mgstauff >> > uid: 2198 >> > group: mgstauff >> > gid: 2198 >> > sge_o_home: /home/mgstauff >> > sge_o_log_name: mgstauff >> > sge_o_path: /share/apps/mricron/ver_2015_ >> 06_01:/share/apps/afni/linux_xorg7_64_2014_06_16:/share/apps >> /c3d/c3d-1.0.0-Linux-x86_64/bin:/share/apps/freesurfer/5. >> 3.0/bin:/share/apps/freesurfer/5.3.0/fsfast/bin:/share/apps/ >> freesurfer/5.3.0/tktools:/share/apps/fsl/5.0.8/bin:/ >> share/apps/freesurfer/5.3.0/mni/bin:/share/apps/fsl/5.0.8/ >> bin:/share/apps/pandoc/1.12.4.2-in-rstudio/:/opt/openmpi/ >> bin:/usr/lib64/qt-3.3/bin:/opt/sge/bin:/opt/sge/bin/lx- >> amd64:/opt/sge/bin:/opt/sge/bin/lx-amd64:/share/admin:/ >> opt/perfsonar_ps/toolkit/scripts:/usr/dbxml-2.3.11/bin:/usr/ >> local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/ >> opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/EMBOSS/ >> bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/bio/ >> hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/fast >> a:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/bio/ >> gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/bio/ >> autodocksuite/bin:/opt/bio/wgs/bin:/opt/ganglia/bin:/opt/ >> ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/pdsh/ >> bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/dell/srvadmin/bin:/ >> home/mgstauff/bin:/share/apps/R/R-3.1.1/bin:/share/apps/ >> rstudio/rstudio-0.98.1091/bin/:/share/apps/ANTs/2014-06-23/ >> build/bin/:/share/apps/matlab/R2014b/bin/:/share/apps/ >> BrainVISA/brainvisa-Mandriva-2008.0-x86_64-4.4.0-2013_11_ >> 18:/share/apps/MIPAV/7.1.0_release:/share/apps/itksnap/ >> itksnap-most-recent/bin/:/share/apps/MRtrix3/2016-04-25/ >> mrtrix3/release/bin/:/share/apps/VoxBo/bin >> > sge_o_shell: /bin/bash >> > sge_o_workdir: /home/mgstauff >> > sge_o_host: chead >> > account: sge >> > hard resource_list: h_stack=128m >> > mail_list: mgstauff@chead.local >> > notify: FALSE >> > job_name: myjobparam >> > jobshare: 0 >> > hard_queue_list: all.q >> > env_list: TERM=NONE >> > job_args: 5 >> > script_file: workshop-files/myjobparam >> > parallel environment: int_test range: 2 >> > binding: set linear:2 >> > job_type: NONE >> > scheduling info: queue instance "gpu.q@compute-gpu-0.local" >> dropped because it is temporarily not available >> > queue instance >> "qlogin.gpu.q@compute-gpu-0.local" dropped because it is temporarily not >> available >> > queue instance "reboot.q@compute-0-18.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-17.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-16.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-13.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-15.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-14.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-12.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-11.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-10.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-9.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-5.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-6.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-7.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-8.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-4.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-2.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-1.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-0.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-20.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-19.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-0-3.local" >> dropped because it is temporarily not available >> > queue instance "reboot.q@compute-gpu-0.local" >> dropped because it is temporarily not available >> > queue instance >> "qlogin.long.q@compute-0-20.local" dropped because it is full >> > queue instance >> "qlogin.long.q@compute-0-19.local" dropped because it is full >> > queue instance >> "qlogin.long.q@compute-gpu-0.local" dropped because it is full >> > queue instance "basic.q@compute-1-2.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-13.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-4.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-2.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-12.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-17.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-3.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-8.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-5.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-11.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-15.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-7.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-14.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-18.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-10.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-6.local" >> dropped because it is full >> > queue instance "himem.q@compute-gpu-0.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-16.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-9.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-0.local" >> dropped because it is full >> > queue instance "himem.q@compute-0-1.local" >> dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-13.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-4.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-2.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-12.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-17.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-3.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-8.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-5.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-11.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-15.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-7.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-14.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-18.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-10.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-6.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-gpu-0.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-16.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-9.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-0.local" dropped because it is full >> > queue instance >> "qlogin.himem.q@compute-0-1.local" dropped because it is full >> > queue instance "qlogin.q@compute-0-20.local" >> dropped because it is full >> > queue instance "qlogin.q@compute-0-19.local" >> dropped because it is full >> > queue instance "qlogin.q@compute-gpu-0.local" >> dropped because it is full >> > queue instance "qlogin.q@compute-0-7.local" >> dropped because it is full >> > queue instance "all.q@compute-0-0.local" >> dropped because it is full >> > cannot run in PE "int_test" because it only >> offers 0 slots >> > >> > [mgstauff@chead ~]$ qquota -u mgstauff >> > resource quota rule limit filter >> > ------------------------------------------------------------ >> -------------------- >> > >> > [mgstauff@chead ~]$ qconf -srqs limit_user_slots >> > { >> > name limit_user_slots >> > description Limit the users' batch slots >> > enabled TRUE >> > limit users {pcook,mgstauff} queues {allalt.q} to slots=32 >> > limit users {*} queues {allalt.q} to slots=0 >> > limit users {*} queues {himem.q} to slots=6 >> > limit users {*} queues {all.q,himem.q} to slots=32 >> > limit users {*} queues {basic.q} to slots=40 >> > } >> > >> > There are plenty of consumables available: >> > >> > [root@chead ~]# qstat -F h_vmem,s_vmem,slots -q all.q a >> > queuename qtype resv/used/tot. load_avg arch >> states >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-0.local BP 0/4/4 5.24 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > qc:slots=0 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-1.local BP 0/10/15 9.58 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > qc:slots=5 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-10.local BP 0/9/16 9.80 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > hc:slots=7 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-11.local BP 0/11/16 9.18 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > hc:slots=5 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-12.local BP 0/11/16 9.72 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > hc:slots=5 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-13.local BP 0/10/16 9.14 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > hc:slots=6 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-14.local BP 0/10/16 9.66 lx-amd64 >> > hc:h_vmem=28.890G >> > hc:s_vmem=30.990G >> > hc:slots=6 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-15.local BP 0/10/16 9.54 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > hc:slots=6 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-16.local BP 0/10/16 10.01 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > hc:slots=6 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-17.local BP 0/11/16 9.75 lx-amd64 >> > hc:h_vmem=29.963G >> > hc:s_vmem=32.960G >> > hc:slots=5 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-18.local BP 0/11/16 10.29 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > hc:slots=5 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-19.local BP 0/9/14 9.01 lx-amd64 >> > qf:h_vmem=5.000G >> > qf:s_vmem=5.000G >> > qc:slots=5 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-2.local BP 0/10/15 9.24 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > qc:slots=5 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-20.local BP 0/0/4 0.00 lx-amd64 >> > qf:h_vmem=3.200G >> > qf:s_vmem=3.200G >> > qc:slots=4 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-3.local BP 0/11/15 9.62 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > qc:slots=4 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-4.local BP 0/12/15 9.85 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > qc:slots=3 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-5.local BP 0/12/15 10.18 lx-amd64 >> > hc:h_vmem=36.490G >> > hc:s_vmem=39.390G >> > qc:slots=3 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-6.local BP 0/12/16 9.95 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > hc:slots=4 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-7.local BP 0/10/16 9.59 lx-amd64 >> > hc:h_vmem=36.935G >> > qf:s_vmem=40.000G >> > hc:slots=5 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-8.local BP 0/10/16 9.37 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > hc:slots=6 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-0-9.local BP 0/10/16 9.38 lx-amd64 >> > qf:h_vmem=40.000G >> > qf:s_vmem=40.000G >> > hc:slots=6 >> > ------------------------------------------------------------ >> --------------------- >> > all.q@compute-gpu-0.local BP 0/0/4 0.05 lx-amd64 >> > qf:h_vmem=3.200G >> > qf:s_vmem=3.200G >> > qc:slots=4 >> > >> > >> > On Mon, Feb 13, 2017 at 2:42 PM, Jesse Becker <becke...@mail.nih.gov> >> wrote: >> > On Mon, Feb 13, 2017 at 02:26:18PM -0500, Michael Stauffer wrote: >> > SoGE 8.1.8 >> > >> > Hi, >> > >> > I'm getting some queued jobs with scheduling info that includes this >> line >> > at the end: >> > >> > cannot run in PE "unihost" because it only offers 0 slots >> > >> > 'unihost' is the only PE I use. When users request multiple slots, they >> use >> > 'unihost': >> > >> > ... -binding linear:2 -pe unihost 2 ... >> > >> > What happens is that these jobs aren't running when it otherwise seems >> like >> > they should be, or they sit waiting in the queue for a long time even >> when >> > the user has plenty of quota available within the queue they've >> requested, >> > and there are enough resources available on the queue's nodes (slots and >> > vram are consumables). >> > >> > Any suggestions about how I might further understand this? >> > >> > This *exact* problem has bitten me in the past. It seems to crop up >> > about every 3 years--long enough to remember it was a problem, and long >> > enough to forget just what the [censored] I did to fix it. >> > >> > As I recall, it has little to do with actual PEs, but everything to do >> > with complexes and resource requests. >> > >> > You might glean a bit more information by running "qsub -w p" (or "-w >> e"). >> > >> > Take a look at these previous discussions: >> > >> > http://gridengine.org/pipermail/users/2011-November/001932.html >> > http://comments.gmane.org/gmane.comp.clustering.opengridengi >> ne.user/1700 >> > >> > >> > -- >> > Jesse Becker (Contractor) >> > >> > _______________________________________________ >> > users mailing list >> > users@gridengine.org >> > https://gridengine.org/mailman/listinfo/users >> >> >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users