Yes, it turned out that we only configured TM<->JM communication correctly
- inter TM config was missing, hence being "reject all". Thanks for the
suggestion!

On Mon, Aug 23, 2021 at 5:29 PM Arvid Heise <ar...@apache.org> wrote:

> It rather looks to me as if the task manager can not communicate with each
> other. Can you check your network policies? Are they allowed to communicate
> on random ports?
>
> On Mon, Aug 23, 2021 at 8:37 AM Zbyszko Papierski <
> zpapier...@wikimedia.org> wrote:
>
>> Hi,
>>
>> No, they don't - only the job is being restarted after that, without any
>> luck. Exception I provided is added to a exceptions list of the job itself.
>>
>> On Mon, Aug 23, 2021 at 4:50 AM Caizhi Weng <tsreape...@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> This might be that some task managers cannot reach out to the job
>>> manager in time. Has any of the task manager instance restarted after this
>>> failure? If yes, what does the log (Flink log and kubernetes log) of the
>>> failed task manager say?
>>>
>>> Zbyszko Papierski <zpapier...@wikimedia.org> 于2021年8月20日周五 下午11:07写道:
>>>
>>>> Hi!
>>>>
>>>> We're trying to successfully deploy our application to our Kubernetes
>>>> cluster and we seem to have hit a snag. Long story short - any kind of
>>>> deployment that involves a cluster of more than 1 TM seem to fail our job
>>>> almost immediately with this exception:
>>>>
>>>> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
>>>>> Sending the partition request to 'null' failed.
>>>>> at
>>>>> org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
>>>>> at
>>>>> org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>>>>> buffer.memory = 33554432
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>>>>>
>>>>> app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:189)
>>>>>
>>>>> app//org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:609)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>>>> at java.base/java.lang.Thread.run(Thread.java:834)
>>>>> Caused by: java.nio.channels.ClosedChannelException
>>>>> at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>>>>> ... 11 more
>>>>
>>>>
>>>> I don't have enough experience to judge from the rest of the logs what
>>>> can be the reason for that, but I'm including the debug logs that I can get
>>>> from kubectl, both JM and TM.
>>>> We use a session cluster deployed on Kubernetes (not Kubernetes native,
>>>> there are still some issues why we can't use it) and we deploy our app to
>>>> that cluster. We have confirmed that everything works when there's a single
>>>> Task Manager, but we rather not continue with that limitation. The way we
>>>> define the cluster itself on k8s is shown here [1] and the chart for the
>>>> deployment itself can be found here [2]. App we're deploying is available
>>>> here [3]. We're running Flink 1.21.1 on openjdk-jre 11.
>>>>
>>>>  Since I overslept k8s revolution a bit and am somewhat new to it, I am
>>>> not sure which information to provide to make the situation clearer, but
>>>> any help is greatly appreciated!
>>>>
>>>>
>>>> [1]
>>>> https://github.com/wikimedia/operations-deployment-charts/tree/master/charts/flink-session-cluster
>>>> [2]
>>>> https://github.com/wikimedia/operations-deployment-charts/tree/master/helmfile.d/services/rdf-streaming-updater
>>>> [3]
>>>> https://github.com/wikimedia/wikidata-query-rdf/tree/master/streaming-updater-producer
>>>> --
>>>>
>>>> Zbyszko Papierski (He/Him)
>>>>
>>>> Senior Software Engineer
>>>>
>>>> Wikimedia Foundation <https://wikimediafoundation.org/>
>>>>
>>>
>>
>> --
>>
>> Zbyszko Papierski (He/Him)
>>
>> Senior Software Engineer
>>
>> Wikimedia Foundation <https://wikimediafoundation.org/>
>>
>

-- 

Zbyszko Papierski (He/Him)

Senior Software Engineer

Wikimedia Foundation <https://wikimediafoundation.org/>

Reply via email to