>
> ##
> In regard of 'int_test' PE you created. If you set allocation rule to
> integer, it would mean that the job _must_ request amount of slots equal or
> multiple to this value.
> In your case, PE is defined to use '8' as allocation rule, so your job
> must request 8 or 16 or 24 ... slots. In case of you request 2, the job
> will never start, as the scheduler can't allocate 2 slots with allocation
> rule set to 8.
>
> From man sge_pe:
> "If  the  number  of  tasks  specified with the "-pe" option (see qsub(1))
> does not  divide  without  remainder  by this  <int>  the  job  will not be
> scheduled. "
>
> So, the fact that the job in int_test never starts if it requests 2 cores
> - is totally fine from the scheduler point of view.
>

OK, thanks very much, that explains it. I'll test accordingly.


> ##
> In regard of this issue in general: just wondering if you, or users on the
> cluster use '-R y' ( reservation ) option for theirs jobs? I have seen such
> a behavior, when someone submits a job with a reservation defined. The
> scheduler reserves slots on the cluster for this big job, and doesn't let
> new jobs come ( especially in case of runtime is not defined by h_rt ). In
> this case, there will be no messages in the scheduler log which is
> confusing some time.
>

I don't think users are using '-R y', but I'm not sure. Do you know how I
can tell that? I think 'qstat -g c' shows that in the RES column? I don't
think I've ever seen non-zero there, but I'll pay attention. However the
stuck-job issue is happening right now to at least one user, and the RES
column is all zeros.

-M


>
> Best regards,
> Mikhail Serkov
>
> On Fri, Aug 11, 2017 at 6:41 PM, Michael Stauffer <mgsta...@gmail.com>
> wrote:
>
>> Hi,
>>
>>
>> Below I've dumped relevant configurations.
>>
>> Today I created a new PE called "int_test" to test the "integer"
>> allocation rule. I set it to 16 (16 cores per node), and have also tried 8.
>> It's been added as a PE to the queues we use. When I try to run to this new
>> PE however, it *always* fails with the same "PE ...offers 0 slots" error,
>> even if I can run the same multi-slot job using "unihost" PE at the same
>> time. I'm not sure if this helps debug or not.
>>
>> Another thought - this behavior started happening some time ago more or
>> less when I tried implementing fairshare behavior. I never seemed to get
>> fairshare working right. We haven't been able to confirm, but for some
>> users it seems this "PE 0 slots" issue pops up only after they've been
>> running other jobs for a little while. So I'm wondering if I've screwed up
>> fairshare in some way that's causing this odd behavior.
>>
>>
>>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to