Hi Till,

Thanks for your reply and help on this issue.

I increased taskmanager.network.netty.client.connectTimeoutSec to 1200
which is 20 minutes. But it seems the connection not respects this timeout.
In addition, I increase both taskmanager.network.request-backoff.max
and taskmanager.registration.max-backoff to 20min.

One thing I found is helpful to some extent is increasing
the taskmanager.network.netty.server.numThreads. I increase it to 128
threads, it can succeed sometimes. But keep increasing it doesn't solve the
problem. We have 100 parallel intermediate results, so there are too many
partition requests. I think that's why it timeout. The solution should let
the connection timeout increase. But I think there is some issue that
connection doesn't respect the timeout config.

We will definitely try the latest flink version. But at Uber, there is a
team who is responsible to provide a platform with Flink. They will upgrade
it at the end of this Month. Meanwhile, I would like to ask some help to
investigate how to increase the connection timeout and make it respected.

Thanks,
Wenrui

On Fri, Jan 4, 2019 at 5:27 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Wenrui,
>
> from the logs I cannot spot anything suspicious. Which configuration
> parameters have you changed exactly? Does the JobManager log contain
> anything suspicious?
>
> The current Flink version changed quite a bit wrt 1.4. Thus, it might be
> worth a try to run the job with the latest Flink version.
>
> Cheers,
> Till
>
> On Thu, Jan 3, 2019 at 3:00 PM Wenrui Meng <wenruim...@gmail.com> wrote:
>
>> Hi,
>>
>> I consistently get connection timeout issue when creating
>> partitionRequestClient in flink 1.4. I tried to ping from the connecting
>> host to the connected host, but the ping latency is less than 0.1 ms
>> consistently. So it's probably not due to the cluster status. I also tried
>> increase max backoff, nettowrk timeout and some other setting, it doesn't
>> help.
>>
>> I enabled the debug log of flink but not find any suspicious or useful
>> information to help me fix the issue. Here is the link
>> <https://www.dropbox.com/sh/sul62muz5pk0bqk/AABX8QbMrNmSq3k8I289mGmSa?dl=0>
>> of the jobManager and taskManager logs. The connecting host is the host
>> which throw the exception. The connected host is the host the connecting
>> host try to request partition from.
>>
>> Since our platform is not up to date yet, the flink version I used in
>> this is 1.4. But I noticed that there is not much change of these code on
>> the Master branch. Any help will be appreciated.
>>
>> Here is stack trace of the exception
>>
>> from RUNNING to FAILED.
>> java.io.IOException: Connecting the channel failed: Connecting to remote
>> task manager + 'athena485-sjc1/10.70.132.8:34185' has failed. This might
>> indicate that the remote task manager has been lost.
>> at
>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
>> at
>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:132)
>> at
>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:84)
>> at
>> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59)
>> at
>> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:156)
>> at
>> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:480)
>> at
>> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:502)
>> at
>> org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:93)
>> at
>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:214)
>> at
>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by:
>> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
>> Connecting to remote task manager + 'athena485-sjc1/10.70.132.8:34185'
>> has failed. This might indicate that the remote task manager has been lost.
>> at
>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
>> at
>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:132)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:214)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>> ... 1 common frames omitted
>> Caused by:
>> org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException:
>> connection timed out: athena485-sjc1/10.70.132.8:34185
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:212)
>> ... 6 common frames omitted
>>
>> Thanks,
>> Wenrui
>>
>

Reply via email to