Hi Till, Thanks for your reply and help on this issue.
I increased taskmanager.network.netty.client.connectTimeoutSec to 1200 which is 20 minutes. But it seems the connection not respects this timeout. In addition, I increase both taskmanager.network.request-backoff.max and taskmanager.registration.max-backoff to 20min. One thing I found is helpful to some extent is increasing the taskmanager.network.netty.server.numThreads. I increase it to 128 threads, it can succeed sometimes. But keep increasing it doesn't solve the problem. We have 100 parallel intermediate results, so there are too many partition requests. I think that's why it timeout. The solution should let the connection timeout increase. But I think there is some issue that connection doesn't respect the timeout config. We will definitely try the latest flink version. But at Uber, there is a team who is responsible to provide a platform with Flink. They will upgrade it at the end of this Month. Meanwhile, I would like to ask some help to investigate how to increase the connection timeout and make it respected. Thanks, Wenrui On Fri, Jan 4, 2019 at 5:27 AM Till Rohrmann <trohrm...@apache.org> wrote: > Hi Wenrui, > > from the logs I cannot spot anything suspicious. Which configuration > parameters have you changed exactly? Does the JobManager log contain > anything suspicious? > > The current Flink version changed quite a bit wrt 1.4. Thus, it might be > worth a try to run the job with the latest Flink version. > > Cheers, > Till > > On Thu, Jan 3, 2019 at 3:00 PM Wenrui Meng <wenruim...@gmail.com> wrote: > >> Hi, >> >> I consistently get connection timeout issue when creating >> partitionRequestClient in flink 1.4. I tried to ping from the connecting >> host to the connected host, but the ping latency is less than 0.1 ms >> consistently. So it's probably not due to the cluster status. I also tried >> increase max backoff, nettowrk timeout and some other setting, it doesn't >> help. >> >> I enabled the debug log of flink but not find any suspicious or useful >> information to help me fix the issue. Here is the link >> <https://www.dropbox.com/sh/sul62muz5pk0bqk/AABX8QbMrNmSq3k8I289mGmSa?dl=0> >> of the jobManager and taskManager logs. The connecting host is the host >> which throw the exception. The connected host is the host the connecting >> host try to request partition from. >> >> Since our platform is not up to date yet, the flink version I used in >> this is 1.4. But I noticed that there is not much change of these code on >> the Master branch. Any help will be appreciated. >> >> Here is stack trace of the exception >> >> from RUNNING to FAILED. >> java.io.IOException: Connecting the channel failed: Connecting to remote >> task manager + 'athena485-sjc1/10.70.132.8:34185' has failed. This might >> indicate that the remote task manager has been lost. >> at >> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197) >> at >> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:132) >> at >> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:84) >> at >> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59) >> at >> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:156) >> at >> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:480) >> at >> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:502) >> at >> org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:93) >> at >> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:214) >> at >> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264) >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: >> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: >> Connecting to remote task manager + 'athena485-sjc1/10.70.132.8:34185' >> has failed. This might indicate that the remote task manager has been lost. >> at >> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220) >> at >> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:132) >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603) >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563) >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424) >> at >> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:214) >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) >> at >> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) >> ... 1 common frames omitted >> Caused by: >> org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: >> connection timed out: athena485-sjc1/10.70.132.8:34185 >> at >> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:212) >> ... 6 common frames omitted >> >> Thanks, >> Wenrui >> >