I have actually found the source of the PE slots problem!
It was this complex value that I have added to manage limiting the total number of jobs of a certain type that can run simultaneously on the cluster:
connections       conn      INT       <=    YES YES        NONE     0

The default value was set to 'NONE' - which is probably represented as 2147483648 :) This was wrong, because this is an INT complex, so when I changed the default to 0, e.g.:
connections       conn      INT       <=    YES YES        0     0

I stopped having the PE scheduling problem!
I thought this might be useful to other people that might make the same mistake!
Regards,
Razvan

On 10/06/16 17:15, Razvan Sultana wrote:
But looking at the discussion here:
https://arc.liv.ac.uk/trac/SGE/ticket/1429
I saw that you were referencing this ticket:
https://arc.liv.ac.uk/trac/SGE/ticket/793
where there is the same message that I've seen and you mention that EXCL might be to blame?

I have actually added this entry to the complex values:
exclusive            excl        BOOL      EXCL    YES YES 0        1000

I tried taking it out but I still see the same errors :(

Razvan

On 10/06/16 16:50, Razvan Sultana wrote:
Hi William,
I haven't touched the h_rt and s_rt values - they are by default INFINITY:
qconf -sq all.q | grep '_rt'
s_rt                  INFINITY
h_rt                  INFINITY

Razvan

On 10/06/16 15:49, William Hay wrote:
On Fri, Jun 10, 2016 at 03:07:53PM +0100, Razvan Sultana wrote:
the job just sits there in a 'qw' state, with this scheduling info showing: scheduling info: cannot run in PE "mpi" because it only offers
2147483648 slots

I have tried anything I could think of - changing the number of slots in th
PE queue, changing the allocation rule, etc.
Nothing changed - all the jobs with `-pe mpi` fail to be scheduled.

This looks like a bug to me.
2147483648 is 0x80000000 and it's -2147483648 when seen as a signed int, so
5 > -2147483648
But of course, the number of available slots to the PE should be anything
but this number (I tried 9999, 99, 10  - no change).

I tried looking in this (and other precursor) discussion list archives for similar error messages and although it pops up from time to time, nobody
seems to know why that is or how to fix it.
Any suggestions to fix this issue?
Does your cluster have a particularly short default runtime:
https://arc.liv.ac.uk/trac/SGE/ticket/1429

William

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss


_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to