Hi Till,

Yes, it turns out the problem was
having flink-queryable-state-runtime_2.11-1.6.2.jar in flink/lib. I guess
Queriable State bootstraps itself and, in my situation, it brought the task
manager down when it found no available ports. What's a little troubling is
that I had not configured Queriable State at all, so I would not expect it
to get in the way. I haven't looked further into it but I think that if
Queriable State wants to enable itself then it should at worst take an
unused port by default, especially since many folks will be running in
shared environments like YARN.

But anyway, thanks for that! I'm now up with 1.6.2.

Cliff

On Mon, Nov 12, 2018 at 6:04 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Cliff,
>
> the TaskManger fail to start with exit code 31 which indicates an
> initialization error on startup. If you check the TaskManager logs via
> `yarn logs -applicationId <APP_ID>` you should see the problem why the TMs
> don't start up.
>
> Cheers,
> Till
>
> On Fri, Nov 9, 2018 at 8:32 PM Cliff Resnick <cre...@gmail.com> wrote:
>
>> Hi Till,
>>
>> Here are Job Manager logs, same job in both 1.6.0 and 1.6.2 at DEBUG
>> level. I saw several errors in 1.6.2, hope it's informative!
>>
>> Cliff
>>
>> On Fri, Nov 9, 2018 at 8:34 AM Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>>> Hi Cliff,
>>>
>>> this sounds not right. Could you share the logs of the Yarn cluster
>>> entrypoint with the community for further debugging? Ideally on DEBUG
>>> level. The Yarn logs would also be helpful to fully understand the problem.
>>> Thanks a lot!
>>>
>>> Cheers,
>>> Till
>>>
>>> On Thu, Nov 8, 2018 at 9:59 PM Cliff Resnick <cre...@gmail.com> wrote:
>>>
>>>> I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a
>>>> configuration of 3 slots per TM. The cluster is dedicated to a single job
>>>> that runs at full capacity in "FLIP6" mode. So in this cluster, the
>>>> parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).
>>>>
>>>> When I run the job in 1.6.0, seven Task Managers are spun up as
>>>> expected. But if I run with 1.6.2 only four Task Managers spin up and the
>>>> job hangs waiting for more resources.
>>>>
>>>> Our Flink distribution is set up by script after building from source.
>>>> So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical.
>>>> The job is the same, restarting from savepoint. The problem is repeatable.
>>>>
>>>> Has something changed in 1.6.2, and if so can it be remedied with a
>>>> config change?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>

Reply via email to