I have a new insight which is very helpful. Thanks to Mark Bergman who
mentioned that the 'PE offers 0 slots' error/warning can also mean memory
limitations.

If the stuck-job problem is happening to a user, I can get jobs to run if I
make no memory request, or make a memory request  (i.e., -l h_vmem=...)
that's less than the default value for the complex. If I request more than
100M greater than the default, the job gets stuck with the "PE offers 0
slots" warning. Interesting!

Any thoughts on this? Again, this is happening when there's plenty of
resources on the nodes and plenty of room in the users quotas.

I'll test more tomorrow, but this may mean I can at least get a workaround
going by having a large default request and forcing users to make an
explicit memory request.

-M

On Tue, Aug 15, 2017 at 6:40 PM, Michael Stauffer <mgsta...@gmail.com>
wrote:

> ##
>> In regard of 'int_test' PE you created. If you set allocation rule to
>> integer, it would mean that the job _must_ request amount of slots equal or
>> multiple to this value.
>> In your case, PE is defined to use '8' as allocation rule, so your job
>> must request 8 or 16 or 24 ... slots. In case of you request 2, the job
>> will never start, as the scheduler can't allocate 2 slots with allocation
>> rule set to 8.
>>
>> From man sge_pe:
>> "If  the  number  of  tasks  specified with the "-pe" option (see
>> qsub(1)) does not  divide  without  remainder  by this  <int>  the  job
>>  will not be scheduled. "
>>
>> So, the fact that the job in int_test never starts if it requests 2 cores
>> - is totally fine from the scheduler point of view.
>>
>
> OK, thanks very much, that explains it. I'll test accordingly.
>
>
>> ##
>> In regard of this issue in general: just wondering if you, or users on
>> the cluster use '-R y' ( reservation ) option for theirs jobs? I have seen
>> such a behavior, when someone submits a job with a reservation defined. The
>> scheduler reserves slots on the cluster for this big job, and doesn't let
>> new jobs come ( especially in case of runtime is not defined by h_rt ). In
>> this case, there will be no messages in the scheduler log which is
>> confusing some time.
>>
>
> I don't think users are using '-R y', but I'm not sure. Do you know how I
> can tell that? I think 'qstat -g c' shows that in the RES column? I don't
> think I've ever seen non-zero there, but I'll pay attention. However the
> stuck-job issue is happening right now to at least one user, and the RES
> column is all zeros.
>
> -M
>
>
>>
>> Best regards,
>> Mikhail Serkov
>>
>> On Fri, Aug 11, 2017 at 6:41 PM, Michael Stauffer <mgsta...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> Below I've dumped relevant configurations.
>>>
>>> Today I created a new PE called "int_test" to test the "integer"
>>> allocation rule. I set it to 16 (16 cores per node), and have also tried 8.
>>> It's been added as a PE to the queues we use. When I try to run to this new
>>> PE however, it *always* fails with the same "PE ...offers 0 slots" error,
>>> even if I can run the same multi-slot job using "unihost" PE at the same
>>> time. I'm not sure if this helps debug or not.
>>>
>>> Another thought - this behavior started happening some time ago more or
>>> less when I tried implementing fairshare behavior. I never seemed to get
>>> fairshare working right. We haven't been able to confirm, but for some
>>> users it seems this "PE 0 slots" issue pops up only after they've been
>>> running other jobs for a little while. So I'm wondering if I've screwed up
>>> fairshare in some way that's causing this odd behavior.
>>>
>>>
>>>
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to