> Am 12.12.2016 um 07:02 schrieb John_Tai <john_...@smics.com>: > > Thank you all for trying to work this out. > > > >>> allocation_rule $fill_up <---work better for parallel jobs > > I do want my job to run on one machine only > >>> control_slaves TRUE < ---- you want tight integration with SGE >>> job_is_first_task <----can go either way, unless you are sure your >>> software will control job distro... > > And the job will be controlled by my software, not SGE. I only need SGE to > keep track of the slots (i.e. CPU cores). > > ------------------------------------------------- > > There were no messages on qmaster or ibm038. The job I submitted is not in > error, it's just waiting for free slots. > > ------------------------------------------------- > > I changed queue slots setting and removed all other PE, but I got the same > error. > > > # qconf -sq all.q > qname all.q > hostlist @allhosts > seq_no 0 > load_thresholds np_load_avg=1.75
Unless you want to oversubscribe by intention, the above can be set to NONE. In fact, it might look ahead of the coming load and together with: $ qconf -ssconf ... job_load_adjustments np_load_avg=0.50 load_adjustment_decay_time 0:7:30 it can lead to the effect, that the job can't be scheduled. This can even be adjusted to read: job_load_adjustments NONE load_adjustment_decay_time 0:0:0 In your current case of course, where 8 slots are defined and you test with 2 this shouldn't be a problem though. Did you set up and/or request any memory per machine? OTOH: if you submit 2 single CPU jobs to node ibm038, are they scheduled? -- Reuti > suspend_thresholds NONE > nsuspend 1 > suspend_interval 00:05:00 > priority 0 > min_cpu_interval 00:05:00 > processors UNDEFINED > qtype BATCH INTERACTIVE > ckpt_list NONE > pe_list cores > rerun FALSE > slots 8 > tmpdir /tmp > shell /bin/sh > prolog NONE > epilog NONE > shell_start_mode posix_compliant > starter_method NONE > suspend_method NONE > resume_method NONE > terminate_method NONE > notify 00:00:60 > owner_list NONE > user_lists NONE > xuser_lists NONE > subordinate_list NONE > complex_values NONE > projects NONE > xprojects NONE > calendar NONE > initial_state default > s_rt INFINITY > h_rt INFINITY > s_cpu INFINITY > h_cpu INFINITY > s_fsize INFINITY > h_fsize INFINITY > s_data INFINITY > h_data INFINITY > s_stack INFINITY > h_stack INFINITY > s_core INFINITY > h_core INFINITY > s_rss INFINITY > h_rss INFINITY > s_vmem INFINITY > h_vmem INFINITY > # qsub -V -b y -cwd -now n -pe cores 2 -q all.q@ibm038 xclock > Your job 92 ("xclock") has been submitted > # qstat > job-ID prior name user state submit/start at queue > slots ja-task-ID > ----------------------------------------------------------------------------------------------------------------- > 91 0.55500 xclock johnt qw 12/12/2016 13:54:02 > 2 > 92 0.00000 xclock johnt qw 12/12/2016 13:55:59 > 2 > # qalter -w p 92 > Job 92 cannot run in queue "pc.q" because it is not contained in its hard > queue list (-q) > Job 92 cannot run in queue "sim.q" because it is not contained in its hard > queue list (-q) > Job 92 cannot run in queue "all.q@ibm021" because it is not contained in its > hard queue list (-q) > Job 92 cannot run in queue "all.q@ibm037" because it is not contained in its > hard queue list (-q) > Job 92 cannot run in PE "cores" because it only offers 0 slots > verification: no suitable queues > > > > -----Original Message----- > From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On > Behalf Of Coleman, Marcus [JRDUS Non-J&J] > Sent: Monday, December 12, 2016 1:35 > To: users@gridengine.org > Subject: Re: [gridengine users] users Digest, Vol 72, Issue 13 > > Hi > > I am sure this is your problem....You are submitting a job that requires 2 > cores, to a queue that has only 1 slot available. > If your host all have the same amount of cores...it is no reason to separate > them via commons. This is only needed if the host have different amount of > slots or you want to manipulate the slots... > > slots 1,[ibm021=8],[ibm037=8],[ibm038=8] > slots 8 > > > I would only list the pe I am using I am requesting...unless you plan to use > each of those PE's > pe_list make mpi smp cores > pe_list cores > > > Also you mentioned parallel env, I WOULD change allocation to $fill_up unless > your software (not sge) control jobs distribution.. > > qconf -sp core > allocation_rule $pe_slots <---( ONLY USE ONE MACHINE) > control_slaves FALSE <--- (I think you want tight integration) > job_is_first_task TRUE <----( this is true if the first job submitted only > kicks off other jobs) > > allocation_rule $fill_up <---work better for parallel jobs > control_slaves TRUE < ---- you want tight integration with SGE > job_is_first_task <----can go either way, unless you are sure your software > will control job distro... > > > Also what does qmaster message and the associated node sge message say... > > > > > > > > > -----Original Message----- > From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On > Behalf Of users-requ...@gridengine.org > Sent: Sunday, December 11, 2016 9:05 PM > To: users@gridengine.org > Subject: [EXTERNAL] users Digest, Vol 72, Issue 13 > > Send users mailing list submissions to > users@gridengine.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://gridengine.org/mailman/listinfo/users > or, via email, send a message with subject or body 'help' to > users-requ...@gridengine.org > > You can reach the person managing the list at > users-ow...@gridengine.org > > When replying, please edit your Subject line so it is more specific than "Re: > Contents of users digest..." > > > Today's Topics: > > 1. Re: CPU complex (John_Tai) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 12 Dec 2016 05:04:33 +0000 > From: John_Tai <john_...@smics.com> > To: Christopher Heiny <christopherhe...@gmail.com> > Cc: "users@gridengine.org" <users@gridengine.org> > Subject: Re: [gridengine users] CPU complex > Message-ID: <EB25FF8EBBD4BC478EF05F2F4C436479021D2BA5A2@shex-d02> > Content-Type: text/plain; charset="utf-8" > > # qconf -sq all.q > qname all.q > hostlist @allhosts > seq_no 0 > load_thresholds np_load_avg=1.75 > suspend_thresholds NONE > nsuspend 1 > suspend_interval 00:05:00 > priority 0 > min_cpu_interval 00:05:00 > processors UNDEFINED > qtype BATCH INTERACTIVE > ckpt_list NONE > pe_list make mpi smp cores > rerun FALSE > slots 1,[ibm021=8],[ibm037=8],[ibm038=8] > tmpdir /tmp > shell /bin/sh > prolog NONE > epilog NONE > shell_start_mode posix_compliant > starter_method NONE > suspend_method NONE > resume_method NONE > terminate_method NONE > notify 00:00:60 > owner_list NONE > user_lists NONE > xuser_lists NONE > subordinate_list NONE > complex_values NONE > projects NONE > xprojects NONE > calendar NONE > initial_state default > s_rt INFINITY > h_rt INFINITY > s_cpu INFINITY > h_cpu INFINITY > s_fsize INFINITY > h_fsize INFINITY > s_data INFINITY > h_data INFINITY > s_stack INFINITY > h_stack INFINITY > s_core INFINITY > h_core INFINITY > s_rss INFINITY > h_rss INFINITY > s_vmem INFINITY > h_vmem INFINITY > > > > From: Christopher Heiny [mailto:christopherhe...@gmail.com] > Sent: Monday, December 12, 2016 12:22 > To: John_Tai > Cc: users@gridengine.org; Reuti > Subject: Re: [gridengine users] CPU complex > > > > On Dec 11, 2016 5:11 PM, "John_Tai" > <john_...@smics.com<mailto:john_...@smics.com>> wrote: > I associated the queue with the PE: > > qconf -aattr queue pe_list cores all.q The only slots were defined in > the all.q queue, and just the total slots in the PE: > >>> # qconf -sp cores >>> pe_name cores >>> slots 999 >>> user_lists NONE >>> xuser_lists NONE > Do I need to define slots in another way for each exec host? Is there a way > to check the current free slots for a host, other than the qstat -f below? > >> # qstat -f >> queuename qtype resv/used/tot. load_avg arch >> states >> --------------------------------------------------------------------------------- >> all.q@ibm021<mailto:all.q@ibm021> BIP 0/0/8 >> 0.02 lx-amd64 >> --------------------------------------------------------------------------------- >> all.q@ibm037<mailto:all.q@ibm037> BIP 0/0/8 >> 0.00 lx-amd64 >> --------------------------------------------------------------------------------- >> all.q@ibm038<mailto:all.q@ibm038> BIP 0/0/8 >> 0.00 lx-amd64 > > What is the output of the command > qconf -sq all.q > ? (I think that's right one) > > Chris > > > > > > > -----Original Message----- > From: Reuti > [mailto:re...@staff.uni-marburg.de<mailto:re...@staff.uni-marburg.de>] > Sent: Saturday, December 10, 2016 5:40 > To: John_Tai > Cc: users@gridengine.org<mailto:users@gridengine.org> > Subject: Re: [gridengine users] CPU complex > > Am 09.12.2016 um 10:36 schrieb John_Tai: > >> 8 slots: >> >> # qstat -f >> queuename qtype resv/used/tot. load_avg arch >> states >> --------------------------------------------------------------------------------- >> all.q@ibm021<mailto:all.q@ibm021> BIP 0/0/8 >> 0.02 lx-amd64 >> --------------------------------------------------------------------------------- >> all.q@ibm037<mailto:all.q@ibm037> BIP 0/0/8 >> 0.00 lx-amd64 >> --------------------------------------------------------------------------------- >> all.q@ibm038<mailto:all.q@ibm038> BIP 0/0/8 >> 0.00 lx-amd64 >> --------------------------------------------------------------------------------- >> pc.q@ibm021<mailto:pc.q@ibm021> BIP 0/0/1 0.02 >> lx-amd64 >> --------------------------------------------------------------------------------- >> sim.q@ibm021<mailto:sim.q@ibm021> BIP 0/0/1 >> 0.02 lx-amd64 > > Is there any limit of slots in the exechost defined, or in an RQS? > > -- Reuti > > >> >> ###################################################################### >> ###### >> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING >> JOBS >> ############################################################################ >> 89 0.55500 xclock johnt qw 12/09/2016 15:14:25 2 >> >> >> >> -----Original Message----- >> From: Reuti >> [mailto:re...@staff.uni-marburg.de<mailto:re...@staff.uni-marburg.de>] >> Sent: Friday, December 09, 2016 3:46 >> To: John_Tai >> Cc: users@gridengine.org<mailto:users@gridengine.org> >> Subject: Re: [gridengine users] CPU complex >> >> Hi, >> >> Am 09.12.2016 um 08:20 schrieb John_Tai: >> >>> I've setup PE but I'm having problems submitting jobs. >>> >>> - Here's the PE I created: >>> >>> # qconf -sp cores >>> pe_name cores >>> slots 999 >>> user_lists NONE >>> xuser_lists NONE >>> start_proc_args /bin/true >>> stop_proc_args /bin/true >>> allocation_rule $pe_slots >>> control_slaves FALSE >>> job_is_first_task TRUE >>> urgency_slots min >>> accounting_summary FALSE >>> qsort_args NONE >>> >>> - I've then added this to all.q: >>> >>> qconf -aattr queue pe_list cores all.q >> >> How many "slots" were defined in there queue definition for all.q? >> >> -- Reuti >> >> >>> - Now I submit a job: >>> >>> # qsub -V -b y -cwd -now n -pe cores 2 -q >>> all.q@ibm038<mailto:all.q@ibm038> xclock Your job >>> 89 ("xclock") has been submitted # qstat >>> job-ID prior name user state submit/start at queue >>> slots ja-task-ID >>> ----------------------------------------------------------------------------------------------------------------- >>> 89 0.00000 xclock johnt qw 12/09/2016 15:14:25 >>> 2 >>> # qalter -w p 89 >>> Job 89 cannot run in PE "cores" because it only offers 0 slots >>> verification: no suitable queues >>> # qstat -f >>> queuename qtype resv/used/tot. load_avg arch >>> states >>> --------------------------------------------------------------------------------- >>> all.q@ibm038<mailto:all.q@ibm038> BIP 0/0/8 >>> 0.00 lx-amd64 >>> >>> ##################################################################### >>> # >>> ###### >>> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING >>> JOBS >>> ############################################################################ >>> 89 0.55500 xclock johnt qw 12/09/2016 15:14:25 2 >>> >>> >>> ---------------------------------------------------- >>> >>> It looks like all.q@ibm038<mailto:all.q@ibm038> should have 8 free slots, >>> so why is it only offering 0? >>> >>> Hope you can help me. >>> Thanks >>> John >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Reuti >>> [mailto:re...@staff.uni-marburg.de<mailto:re...@staff.uni-marburg.de> >>> ] >>> Sent: Monday, December 05, 2016 6:32 >>> To: John_Tai >>> Cc: users@gridengine.org<mailto:users@gridengine.org> >>> Subject: Re: [gridengine users] CPU complex >>> >>> Hi, >>> >>>> Am 05.12.2016 um 09:36 schrieb John_Tai >>>> <john_...@smics.com<mailto:john_...@smics.com>>: >>>> >>>> Thank you so much for your reply! >>>> >>>>>> Will you use the consumable virtual_free here instead mem? >>>> >>>> Yes I meant to write virtual_free, not mem. Apologies. >>>> >>>>>> For parallel jobs you need to configure a (or some) so called PE >>>>>> (Parallel Environment). >>>> >>>> My jobs are actually just one process which uses multiple cores, so for >>>> example in top one process "simv" is currently using 2 cpu cores (200%). >>> >>> Yes, then it's a parallel job for SGE. Although the entries for >>> start_proc_args resp. stop_proc_args can be left untouched to the default, >>> a PE is the paradigm in SGE for a parallel job. >>> >>> >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>>> 3017 kelly 20 0 3353m 3.0g 165m R 200.0 0.6 15645:46 simv >>>> >>>> So I'm not sure PE is suitable for my case, since it is not multiple >>>> parallel processes running at the same time. Am I correct? >>>> >>>> If so, I am trying to find a way to get SGE to keep track of the number of >>>> cores used, but I believe it only keeps track of the total CPU usage in %. >>>> I guess I could use this and and the <total num cores> to get the <num of >>>> cores in use>, but how to integrate it in SGE? >>> >>> You can specify a necessary number of cores for your job in the -pe >>> parameter, which can also be a range. The granted allocation by SGE you can >>> check in the job script $NHOSTS, $NSLOTS, $PE_HOSTFILE. >>> >>> Having this setup, SGE will track the number of used cores per machine. The >>> available ones you define in the queue definition. In case you have more >>> than one queue per exechost, we need to setup in addition an overall limit >>> of cores which can be used at the same time to avoid oversubscription. >>> >>> -- Reuti >>> >>>> Thank you again for your help. >>>> >>>> John >>>> >>>> -----Original Message----- >>>> From: Reuti >>>> [mailto:re...@staff.uni-marburg.de<mailto:re...@staff.uni-marburg.de >>>>> ] >>>> Sent: Monday, December 05, 2016 4:21 >>>> To: John_Tai >>>> Cc: users@gridengine.org<mailto:users@gridengine.org> >>>> Subject: Re: [gridengine users] CPU complex >>>> >>>> Hi, >>>> >>>> Am 05.12.2016 um 08:00 schrieb John_Tai: >>>> >>>>> Newbie here, hope to understand SGE usage. >>>>> >>>>> I've successfully configured virtual_free as a complex for telling SGE >>>>> how much memory is needed when submitting a job, as described here: >>>>> >>>>> https://docs.oracle.com/cd/E19957-01/820-0698/6ncdvjclk/index.html# >>>>> <https://docs.oracle.com/cd/E19957-01/820-0698/6ncdvjclk/index.html >>>>>> >>>>> i >>>>> 1000029 >>>>> >>>>> How do I do the same for telling SGE how many CPU cores a job needs? For >>>>> example: >>>>> >>>>> qsub -l mem=24G,cpu=4 myjob >>>> >>>> Will you use the consumable virtual_free here instead mem? >>>> >>>> >>>>> Obviously I'd need for SGE to keep track of the actual CPU utilization in >>>>> the host, just as virtual_free is being tracked independently of the SGE >>>>> jobs. >>>> >>>> For parallel jobs you need to configure a (or some) so called PE (Parallel >>>> Environment). Purpose of this is, to make preparations for the parallel >>>> jobs like rearranging the list of granted slots, prepare shared >>>> directories between the nodes,... >>>> >>>> These PEs were of higher importance in former times, when parallel >>>> libraries were not programmed to integrate automatically in SGE for a >>>> tight integration. Your submissions could read: >>>> >>>> qsub -pe smp 4 myjob # allocation_rule $peslots, control_slaves true >>>> qsub -pe orte 16 myjob # allovation_rule $round_robin, >>>> control_slaves tue >>>> >>>> where smp resp. orte is the chosen parallel environment for OpenMP resp. >>>> Open MPI. Its settings are explained in `man sge_pe`, the "-pe" parameter >>>> to in the submission command in `man qsub`. >>>> >>>> -- Reuti >>>> ________________________________ >>>> >>>> This email (including its attachments, if any) may be confidential and >>>> proprietary information of SMIC, and intended only for the use of the >>>> named recipient(s) above. Any unauthorized use or disclosure of this email >>>> is strictly prohibited. If you are not the intended recipient(s), please >>>> notify the sender immediately and delete this email from your computer. >>>> >>> >>> ________________________________ >>> >>> This email (including its attachments, if any) may be confidential and >>> proprietary information of SMIC, and intended only for the use of the named >>> recipient(s) above. Any unauthorized use or disclosure of this email is >>> strictly prohibited. If you are not the intended recipient(s), please >>> notify the sender immediately and delete this email from your computer. >>> >> >> ________________________________ >> >> This email (including its attachments, if any) may be confidential and >> proprietary information of SMIC, and intended only for the use of the named >> recipient(s) above. Any unauthorized use or disclosure of this email is >> strictly prohibited. If you are not the intended recipient(s), please notify >> the sender immediately and delete this email from your computer. >> > > ________________________________ > > This email (including its attachments, if any) may be confidential and > proprietary information of SMIC, and intended only for the use of the named > recipient(s) above. Any unauthorized use or disclosure of this email is > strictly prohibited. If you are not the intended recipient(s), please notify > the sender immediately and delete this email from your computer. > > _______________________________________________ > users mailing list > users@gridengine.org<mailto:users@gridengine.org> > https://gridengine.org/mailman/listinfo/users > > ________________________________ > This email (including its attachments, if any) may be confidential and > proprietary information of SMIC, and intended only for the use of the named > recipient(s) above. Any unauthorized use or disclosure of this email is > strictly prohibited. If you are not the intended recipient(s), please notify > the sender immediately and delete this email from your computer. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <http://gridengine.org/pipermail/users/attachments/20161212/5666d5d4/attachment.html> > > ------------------------------ > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users > > > End of users Digest, Vol 72, Issue 13 > ************************************* > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users > ________________________________ > > This email (including its attachments, if any) may be confidential and > proprietary information of SMIC, and intended only for the use of the named > recipient(s) above. Any unauthorized use or disclosure of this email is > strictly prohibited. If you are not the intended recipient(s), please notify > the sender immediately and delete this email from your computer. > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users