I have actually found the source of the PE slots problem!
It was this complex value that I have added to manage limiting the total
number of jobs of a certain type that can run simultaneously on the cluster:
connections conn INT <= YES YES NONE 0
The default value was set to 'NONE' - which is probably represented as
2147483648 :)
This was wrong, because this is an INT complex, so when I changed the
default to 0, e.g.:
connections conn INT <= YES YES 0 0
I stopped having the PE scheduling problem!
I thought this might be useful to other people that might make the same
mistake!
Regards,
Razvan
On 10/06/16 17:15, Razvan Sultana wrote:
But looking at the discussion here:
https://arc.liv.ac.uk/trac/SGE/ticket/1429
I saw that you were referencing this ticket:
https://arc.liv.ac.uk/trac/SGE/ticket/793
where there is the same message that I've seen and you mention that
EXCL might be to blame?
I have actually added this entry to the complex values:
exclusive excl BOOL EXCL YES YES 0 1000
I tried taking it out but I still see the same errors :(
Razvan
On 10/06/16 16:50, Razvan Sultana wrote:
Hi William,
I haven't touched the h_rt and s_rt values - they are by default
INFINITY:
qconf -sq all.q | grep '_rt'
s_rt INFINITY
h_rt INFINITY
Razvan
On 10/06/16 15:49, William Hay wrote:
On Fri, Jun 10, 2016 at 03:07:53PM +0100, Razvan Sultana wrote:
the job just sits there in a 'qw' state, with this scheduling info
showing:
scheduling info: cannot run in PE "mpi" because it only
offers
2147483648 slots
I have tried anything I could think of - changing the number of
slots in th
PE queue, changing the allocation rule, etc.
Nothing changed - all the jobs with `-pe mpi` fail to be scheduled.
This looks like a bug to me.
2147483648 is 0x80000000 and it's -2147483648 when seen as a signed
int, so
5 > -2147483648
But of course, the number of available slots to the PE should be
anything
but this number (I tried 9999, 99, 10 - no change).
I tried looking in this (and other precursor) discussion list
archives for
similar error messages and although it pops up from time to time,
nobody
seems to know why that is or how to fix it.
Any suggestions to fix this issue?
Does your cluster have a particularly short default runtime:
https://arc.liv.ac.uk/trac/SGE/ticket/1429
William
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss