We have a variety of job types running on our system. Some are short single core jobs and others are long multicore jobs. In order for the large jobs to not be starved we are using job reservations. All jobs are submitted with appropriate h_rt values.
In trying to diagnose some apparent scheduling issues I have turned job monitoring on, but I'm not seeing reservations being applied for lots of jobs. % qconf -ssconf algorithm default schedule_interval 0:00:45 maxujobs 0 queue_sort_method seqno job_load_adjustments NONE load_adjustment_decay_time 0:7:30 load_formula m_core-slots schedd_job_info true flush_submit_sec 5 flush_finish_sec 30 params MONITOR=1 reprioritize_interval 0:0:0 halftime 168 usage_weight_list cpu=0.500000,mem=0.500000,io=0.000000 compensation_factor 5.000000 weight_user 0.250000 weight_project 0.250000 weight_department 0.250000 weight_job 0.000000 weight_tickets_functional 10000000 weight_tickets_share 0 share_override_tickets TRUE share_functional_shares TRUE max_functional_jobs_to_schedule 200 report_pjob_tickets TRUE max_pending_tasks_per_job 50 halflife_decay_list none policy_hierarchy OF weight_ticket 2.000000 weight_waiting_time 0.000050 weight_deadline 3600000.000000 weight_urgency 0.000000 weight_priority 10.000000 max_reservation 50 default_duration 168:00:00 % Submitted jobs: % qsub -R y -p 500 -l hostname=bc130 do-sleep Your job 7684269 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc131 do-sleep Your job 7684270 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc132 do-sleep Your job 7684271 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc133 do-sleep Your job 7684272 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc134 -pe thread 24 do-sleep Your job 7684280 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc135 -pe thread 24 do-sleep Your job 7684281 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc136 -pe thread 24 do-sleep Your job 7684282 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc137 -pe thread 24 do-sleep Your job 7684283 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc138 -pe thread 12 do-sleep Your job 7684286 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc139 -pe thread 12 do-sleep Your job 7684287 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc140 -pe thread 8 do-sleep Your job 7684289 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc141 -pe thread 4 do-sleep Your job 7684290 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc142 -pe thread 2 do-sleep Your job 7684292 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc143 -pe thread 1 do-sleep Your job 7684293 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc144 do-sleep Your job 7684294 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc145 -pe orte 1 do-sleep Your job 7684439 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc144 do-sleep Your job 7684589 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc144 -l exclusive do-sleep Your job 7684590 ("SLEEPER") has been submitted % qsub -R y -p 500 -l hostname=bc144 -l exclusive=0 do-sleep Your job 7684591 ("SLEEPER") has been submitted % % qsub -R y -p 500 -l hostname=bc168 -pe thread 40 do-sleep Your job 7684595 ("SLEEPER") has been submitted % These submissions all request reservations with "-R y" and bump up the priority to ensure they are at the top of the queue. They request specific hosts already running the large user jobs. Job 7684595 requests all cores on a different node type running many of the smaller jobs. The do-sleep script does a random sleep and has the following at the start of the script: #!/bin/sh #$ -S /bin/sh #$ -cwd #$ -j y #$ -m n #$ -N SLEEPER #$ -o LOGS #$ -p -500 # lowest priority #$ -l h_rt=1:00:00 # time limit #$ -l h_vmem=0.5G # memory limit Looking at /opt/sge_root/betsy/common/schedule I see: % cat /opt/sge_root/betsy/common/schedule | grep ':RESERVING:' | sort | uniq -c 93 7684269:1:RESERVING:1583062196:3660:H:bc130.fda.gov:h_vmem:536870912.000000 93 7684269:1:RESERVING:1583062196:3660:H:bc130.fda.gov:ram:536870912.000000 93 7684269:1:RESERVING:1583062196:3660:H:bc130.fda.gov:slots:1.000000 93 7684269:1:RESERVING:1583062196:3660:Q:sh...@bc130.fda.gov:exclusive:1.000000 93 7684269:1:RESERVING:1583062196:3660:Q:sh...@bc130.fda.gov:slots:1.000000 93 7684270:1:RESERVING:1583018960:3660:H:bc131.fda.gov:h_vmem:536870912.000000 93 7684270:1:RESERVING:1583018960:3660:H:bc131.fda.gov:ram:536870912.000000 93 7684270:1:RESERVING:1583018960:3660:H:bc131.fda.gov:slots:1.000000 93 7684270:1:RESERVING:1583018960:3660:Q:sh...@bc131.fda.gov:exclusive:1.000000 93 7684270:1:RESERVING:1583018960:3660:Q:sh...@bc131.fda.gov:slots:1.000000 93 7684271:1:RESERVING:1583063682:3660:H:bc132.fda.gov:h_vmem:536870912.000000 93 7684271:1:RESERVING:1583063682:3660:H:bc132.fda.gov:ram:536870912.000000 93 7684271:1:RESERVING:1583063682:3660:H:bc132.fda.gov:slots:1.000000 93 7684271:1:RESERVING:1583063682:3660:Q:sh...@bc132.fda.gov:exclusive:1.000000 93 7684271:1:RESERVING:1583063682:3660:Q:sh...@bc132.fda.gov:slots:1.000000 93 7684272:1:RESERVING:1583076042:3660:H:bc133.fda.gov:h_vmem:536870912.000000 93 7684272:1:RESERVING:1583076042:3660:H:bc133.fda.gov:ram:536870912.000000 93 7684272:1:RESERVING:1583076042:3660:H:bc133.fda.gov:slots:1.000000 93 7684272:1:RESERVING:1583076042:3660:Q:sh...@bc133.fda.gov:exclusive:1.000000 93 7684272:1:RESERVING:1583076042:3660:Q:sh...@bc133.fda.gov:slots:1.000000 73 7684294:1:RESERVING:1582975134:3660:H:bc144.fda.gov:h_vmem:536870912.000000 73 7684294:1:RESERVING:1582975134:3660:H:bc144.fda.gov:ram:536870912.000000 73 7684294:1:RESERVING:1582975134:3660:H:bc144.fda.gov:slots:1.000000 73 7684294:1:RESERVING:1582975134:3660:Q:sh...@bc144.fda.gov:exclusive:1.000000 73 7684294:1:RESERVING:1582975134:3660:Q:sh...@bc144.fda.gov:slots:1.000000 31 7684589:1:RESERVING:1582975134:3660:H:bc144.fda.gov:h_vmem:536870912.000000 31 7684589:1:RESERVING:1582975134:3660:H:bc144.fda.gov:ram:536870912.000000 31 7684589:1:RESERVING:1582975134:3660:H:bc144.fda.gov:slots:1.000000 31 7684589:1:RESERVING:1582975134:3660:Q:sh...@bc144.fda.gov:exclusive:1.000000 31 7684589:1:RESERVING:1582975134:3660:Q:sh...@bc144.fda.gov:slots:1.000000 30 7684590:1:RESERVING:1582978794:3660:H:bc144.fda.gov:h_vmem:536870912.000000 30 7684590:1:RESERVING:1582978794:3660:H:bc144.fda.gov:ram:536870912.000000 30 7684590:1:RESERVING:1582978794:3660:H:bc144.fda.gov:slots:1.000000 30 7684590:1:RESERVING:1582978794:3660:Q:sh...@bc144.fda.gov:exclusive:1.000000 30 7684590:1:RESERVING:1582978794:3660:Q:sh...@bc144.fda.gov:slots:1.000000 25 7684591:1:RESERVING:1582975134:3660:H:bc144.fda.gov:h_vmem:536870912.000000 25 7684591:1:RESERVING:1582975134:3660:H:bc144.fda.gov:ram:536870912.000000 25 7684591:1:RESERVING:1582975134:3660:H:bc144.fda.gov:slots:1.000000 25 7684591:1:RESERVING:1582975134:3660:Q:sh...@bc144.fda.gov:exclusive:1.000000 25 7684591:1:RESERVING:1582975134:3660:Q:sh...@bc144.fda.gov:slots:1.000000 2 7684595:1:RESERVING:1585427949:3660:H:bc168.fda.gov:h_vmem:21474836480.000000 2 7684595:1:RESERVING:1585427949:3660:H:bc168.fda.gov:ram:21474836480.000000 2 7684595:1:RESERVING:1585427949:3660:H:bc168.fda.gov:slots:40.000000 2 7684595:1:RESERVING:1585427949:3660:P:thread:slots:40.000000 2 7684595:1:RESERVING:1585427949:3660:Q:l...@bc168.fda.gov:exclusive:40.000000 2 7684595:1:RESERVING:1585427949:3660:Q:l...@bc168.fda.gov:slots:40.000000 % Among others there should have been a reservation for job 7684281 on node bc135. Looking at scheduling information for bc135 shows: % cat /opt/sge_root/betsy/common/schedule | grep 'bc135' | sort | uniq -c 123 7678355:1:RUNNING:1582831417:172860:H:bc135.fda.gov:h_vmem:51539607552.000000 123 7678355:1:RUNNING:1582831417:172860:H:bc135.fda.gov:ram:51539607552.000000 123 7678355:1:RUNNING:1582831417:172860:H:bc135.fda.gov:slots:24.000000 123 7678355:1:RUNNING:1582831417:172860:Q:sh...@bc135.fda.gov:exclusive:24.000000 123 7678355:1:RUNNING:1582831417:172860:Q:sh...@bc135.fda.gov:slots:24.000000 % It looks like most of the jobs requesting a parallel environment are not getting reservations on nodes with existing jobs with similar parallel environments. However, the job 7684595 did get a reservation for it's parallel environment. There are almost 400 pending jobs for other users in the queue. Most are for a single user requesting reservations with "-R y -pe thread 24" but do not seem to be getting any reservations. My jobs are at the top of the queue due to the priority. The relevant complex variables are: % qconf -sc | egrep 'h_vmem|ram|slots|exclusive|relop|---' #name shortcut type relop requestable consumable default urgency #---------------------------------------------------------------------------------------------- exclusive excl BOOL EXCL YES YES 0 50 h_vmem h_vmem MEMORY <= YES YES 2G 0 ram ram MEMORY <= YES YES 0 0 slots s INT <= YES YES 1 100 # >#< starts a comment but comments are not saved across edits -------- % The parallel environment is: % qconf -sp thread pe_name thread slots 99999 user_lists NONE xuser_lists NONE start_proc_args NONE stop_proc_args NONE allocation_rule $pe_slots control_slaves TRUE job_is_first_task TRUE urgency_slots min accounting_summary TRUE qsort_args NONE % We are running Son of Grid Engine 8.1.8. I see issues #1552 and #1553 fixed in 8.1.9 but those don't seem relevant. Any thoughts on what might be happening? Thanks, Stuart Barkley -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users