Am 28.02.2012 um 14:44 schrieb Txema Heredia Genestar: > Ok, thanks. > I have tried dividing the total memory/#threads and it worked. I didn't try > it before because I had previous experiences doing that in other parallel > jobs and they failed. I suppose I did it in mpi processes and the such. > > As far as I understand, correct me if I'm wrong, the h_vmem is allocated/set > the limit just for the shepherd process, isn't it? In that case, if all slots > are assigned to the same node, the limit is set to the whole memory (say > 10Gb). But if SGE schedules the job in 5 different nodes, then the limit for > each node is set to its proportional part (say 2Gb). > Is that right or am I missing something?
Correct. On the master node of a parallel job, the process of the job script will get a multiplied h_vmem, i.e. 10 GB. In addition each local `qrsh` would get 2 GB. The overall consumption is then checked by SGE whether all processes in total are using more than 10 GB, so you cannot pass the limit. Unfortunately, this will not work for processes on slave nodes, each just gets 2 GB, so you can't use threads there to have them working all on the same e.g. 10 GB in case you get 5 remote slots in addition and requested 2 GB. https://arc.liv.ac.uk/trac/SGE/ticket/197 I had the idea to add a new parameter: limit_to_one_qrsh_per_host yes/no and then SGE would know to multiply the h_vmem on a slave node by number of granted slots too. The problem for now is, that there is no "main" slave task which can be granted the full amount by intention. The workaround would be to multiply the first `qrsh` to each slave node by the number of granted slots thereon. -- Reuti > Txema > > > PS: Sorry for the delay on the answer, I wanted to try a few things before > emailing again, but I'm having problems with mpi right now. I'll open a new > thread later on. > > > > El 22/02/12 19:39, Reuti escribió: >> Am 22.02.2012 um 18:57 schrieb Txema Heredia Genestar: >> >>> Thanks for yous answers, I'll go one by one, but first, a few >>> clarifications: >>> 1- We are stuck with 6.1u4. In a few weeks we will install a new cluster, >>> with a more recent version. >>> 2- I don't care about "smp". In fact, before reading your answers I never >>> understood properly the differences between $pe_slots and $fill_up. I have >>> a $pe_slots parallel environment called "threaded" and the problem is still >>> there. Basically, I just want my PE to NOT multiply the memory reservation. >>> >>> Now, your answers: >>> Bob - I would like to use "consumable JOB", but, unfortunately, this is not >>> available until SGE 6.2. Even though, that would screw up any mpi job >>> trying to run in our cluster. We mainly run single-core jobs, but from time >>> to time some threaded or mpi jobs need to be run. >>> >>> Mazouzi - Right now I have PE's only available in two "testing" nodes. The >>> problem happens in them both. >>> >>> Reuti - I have tried both combinations: 1-queue@1-node and 1-queue@N-nodes. >>> No luck, same problem everywhere. In fact, one node has 48Gb while the >>> other has 56Gb, so when I ask for a 6-threaded 10Gb job (60Gb total), one >>> node replies stating that it only offers 4 slots, and the other offers 5. >> Sure, if you request a particular node, then $fill_up and $pe_slots will >> have the same effect. So you are limit to the installed memory. 60 GB isn't >> installed, so you can't get it. >> >> It's necessary to divide by hand before and request an uprounded 2GB, hence >> you get 12 GB for your 6 slots threaded job. >> >> -- Reuti >> >> >>> I have read your ticket and that is exactly my problem, the resources >>> multiply. But, as far as I know, they solved it with the "consumable JOB" >>> thing? Unfortunately the links are broken ( >>> http://gridengine.sunsource.net/nonav/source/browse/~checkout~/gridengine/doc/devel/rfe/non-multiplied-pe-requests.txt >>> ). >>> JSV's are a nope in 6.1u4 >>> >>> William - Yours is my best bet. Long time ago I tried tinkering with the >>> "slots" attribute, but never thought about adding this threaded one. I only >>> see one (minor) flaw in your solution: I cannot ask for an interval of >>> threads (from 4 to 8) as with -pe. This condemns to oblivion in the waiting >>> queue any job sent while our cluster is under some load. That would need to >>> be addressed by manually scheduling. But that will do, thanks. >>> >>> Thank you very much. >>> >>> Txema >>> >>> PS: One last question: As I have no experience with 6.2 and JSV, what >>> should be my to-go approach once we install our new cluster with an >>> up-to-date version? >>> >>> >>> >>> >>> El 21/02/12 21:40, Reuti escribió: >>>> Hi, >>>> >>>> Am 21.02.2012 um 20:20 schrieb Txema Heredia Genestar: >>>> >>>>> Hello all, >>>>> >>>>> I am having some problems to run threaded jobs in SGE 6.1u4. In our >>>>> cluster, h_vmem is defined as a consumable attribute in all nodes. It is >>>>> mandatory, all jobs must request it, with a default value of 6Gb. That >>>>> constraint leads any "parallel" job sent to the cluster to try to reserve >>>>> a lot of memory (h_vmem * slots). This is ok for most parallel processes >>>>> (mpi and the such). But, sometimes, we need to run "threaded" jobs, where >>>>> all jobs share a chunk of memory (everything on a single node). This >>>>> leads to situations where I need to send an 8-threaded job that requires, >>>>> say, 10 Gb of memory, but it cannot be scheduled because no node can >>>>> handle a 80Gb request. When a memory request cannot be fulfilled, the >>>>> typical message of "cannot run in PE "smp" because it only offers N >>>>> slots" appears in qstat (where N is the maximum number of slots I wolud >>>>> be able to use given the requested h_vmem size). >>>>> >>>>> This is the parallel environment I am trying to use: >>>>> >>>>> # qconf -sp smp >>>>> pe_name smp >>>>> slots 9999 >>>>> user_lists test_users >>>>> xuser_lists NONE >>>>> start_proc_args /bin/true >>>>> stop_proc_args /bin/true >>>>> allocation_rule $fill_up >>>> for SMP mode you will need $pe_slots here, unless you are requesting >>>> exactly one node in addition in the submission command. >>>> >>>> I assume before you got simply more than one node. >>>> >>>> == >>>> >>>> The answer from Bob changing the complex h_vmem to JOB would help for this >>>> type of job, but not if you have also MPI jobs in the cluster. I had an >>>> RFE for introducing this on a PE level: >>>> >>>> https://arc.liv.ac.uk/trac/SGE/ticket/197 >>>> >>>> To cite from the issue "Therefore I wrote, that an entry inthe PE would >>>> still be advantageous: h_vmem can only be JOBS or YES" >>>> >>>> == >>>> >>>> For now: you could adjust the memory request in a JSV depending on the >>>> requested PE, but for this you need 6.2 IIRC. >>>> >>>> -- Reuti >>>> >>>> >>>>> control_slaves FALSE >>>>> job_is_first_task FALSE >>>>> urgency_slots min >>>>> >>>>> The most annoying part of all this is that this behaviour is not >>>>> consistent: This morning I've been able to run a 6-threaded job >>>>> requesting 10Gb of memory in a 48Gb node. But, in the afternoon, the same >>>>> job using the very same command in the same node could not be run. >>>>> >>>>> Does anyone have any suggestion on how to deal with this? >>>>> >>>>> Thanks in advance, >>>>> >>>>> Txema >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >>> >>> > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
