Hi Reuti, There are dozens on hosts in @gpu. In my test submissions, however, I am using only one host that I specify with '-l hostname='. I disabled all other queues on this host to make sure nothing else but my test jobs are running there.
BTW, after several hours, my PE 1 job went through. My submissions to regular queue worked fine. Update: As I was writing this response, I tried one change in the queue configuration: I created a new host group with only one node it it and changed my test queue to only run on that hostgroup. I submitted a couple of PE jobs with allocation rules '1', '2', '4', and did not request a specific hostname this time. The jobs started running immediately. And the old jobs that had been waiting, also went through. After discovering that, I tested normal production queue, combining '-l hostanme=' and '-pe'. These jobs did not run and 'qalter -w v' reported "cannot run because it exceeds limit "ilya/////" in rule "limit_slots_for_users/1" So in my cluster, there seems to be some issue with RQS, PE and '-l hostname=' combination that makes jobs unschedulable. I wonder if anyone else can reproduce this behavior to see if this is an SGE bug or some problem in my configuration. Ilya. On Fri, Apr 20, 2018 at 12:34 PM, Reuti <re...@staff.uni-marburg.de> wrote: > Hi, > > Am 20.04.2018 um 21:04 schrieb Ilya M: > > > Hello, > > > > I set up a test queue to test a new prolog/epilog scripts and I am > seeing some strange behavior when I submit a PE job to this queue, which > causes the job to not get scheduled forever or for a very long period of > time. I tried several PE with allocation rules of '1', '2', '4'. All to no > avail. Submitting a job without a PE makes it run immediately. I am using > SGE 2.6u5. > > > > Checking why it is not running: > > $ qalter -w v 7301747 > > ... > > Job 7301747 cannot run because it exceeds limit "ilya/////" in rule > "limit_slots_for_users/1" > > Job 7301747 cannot run in PE "pe_1" because it only offers 0 slots > > This error message is often misleading, although there is a real reason > preventing the scheduling. > > > verification: no suitable queues > > > > $ qconf -sp pe_1 > > pe_name pe_1 > > slots 9999999 > > user_lists NONE > > xuser_lists NONE > > start_proc_args startmpi.sh $pe_hostfile > > stop_proc_args stopmpi.sh $pe_hostfile > > allocation_rule 1 > > control_slaves TRUE > > job_is_first_task TRUE > > urgency_slots min > > accounting_summary FALSE > > > > $ qconf -srqs limit_slots_for_users > > { > > name limit_slots_for_users > > description "limit the number of simultaneous slots any user can use" > > enabled TRUE > > limit users {*} to slots=800 > > } > > > > And finally, > > $ qstat > > job-ID prior name user state submit/start at queue > slots ja-task-ID > > ------------------------------------------------------------ > ----------------------------------------------------- > > 7301584 0.60051 sleep ilya qw 04/20/2018 18:29:26 > 4 > > 7301747 0.50051 sleep ilya qw 04/20/2018 18:36:23 > 1 > > > > So I am not running anything at the moment. If I submit a job with the > same PE to a production queue, it will get scheduled. > > > > A job that I left hanging last night, finally got scheduled after 7-8 > hours. > > > > The test queue is a follows: > > qconf -sq test_gpu.q > > qname test_gpu.q > > hostlist @gpu > > How many hosts are in @gpu? The allocation_rule 1 means exactly one slot > per machine – not as often 1 as the node is filled (this is different form > Torque, where this can be assigned several times per host). > > > > seq_no 0 > > load_thresholds np_load_avg=1.75 > > suspend_thresholds NONE > > nsuspend 1 > > suspend_interval 00:05:00 > > priority 0 > > min_cpu_interval 00:05:00 > > processors UNDEFINED > > qtype BATCH INTERACTIVE > > ckpt_list NONE > > pe_list make pe_1 pe_2 pe_3 pe_4 pe_slots > > rerun TRUE > > slots 4 > > tmpdir /data > > shell /bin/sh > > prolog sgeg...@prolog.sh > > epilog sgeg...@epilog.sh > > shell_start_mode unix_behavior > > starter_method NONE > > suspend_method NONE > > resume_method NONE > > terminate_method custom_kill -p $job_pid -j $job_id > > I don't know about your custom_kill procedure, but it should kill > -$job_pid, i.e. the process group and not only a single process. > > - Reuti
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users