Hi Till, Thanks for your reply. Our cluster is Yarn cluster. I found that if we decrease the total parallel the timeout issue can be avoided. But we do need that amount of taskManagers to process data. In addition, once I increase the netty server threads to 128, the error is changed to to following error. It seems the cause is different. Could you help take a look?
2b0ac47c1eb1bcbbbe4a97) switched from RUNNING to FAILED. java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'athena464-sjc1/10.70.129.13:39466' has failed. This might indicate that the remote task manager has been lost. at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:132) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:84) at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59) at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:156) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:480) at org.apache.flink.runtime.io.network.partition.consumer.UnionInputGate.requestPartitions(UnionInputGate.java:134) at org.apache.flink.runtime.io.network.partition.consumer.UnionInputGate.getNextBufferOrEvent(UnionInputGate.java:148) at org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:93) at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:214) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager + 'athena464-sjc1/10.70.129.13:39466' has failed. This might indicate that the remote task manager has been lost. at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:132) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424) at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268) at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) ... 1 common frames omitted Caused by: java.net.ConnectException: Connection timed out: athena464-sjc1/ 10.70.129.13:39466 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281) ... 6 common frames omitted Thanks, Wenrui On Mon, Jan 7, 2019 at 2:38 AM Till Rohrmann <trohrm...@apache.org> wrote: > Hi Wenrui, > > the code to set the connect timeout looks ok to me [1]. I also tested it > locally and checked that the timeout is correctly registered in Netty's > AbstractNioChannel [2]. > > Increasing the number of threads to 128 should not be necessary. But it > could indicate that there is some long lasting or blocking operation being > executed by the threads. > > How does the job submission and cluster configuration work with AthenaX? > Will the platform spawn for each job a new Flink cluster for which you can > specify the cluster configuration? > > [1] > https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/netty/NettyClient.java#L102 > [2] > https://github.com/netty/netty/blob/netty-4.0.27.Final/transport/src/main/java/io/netty/channel/nio/AbstractNioChannel.java#L207 > > Cheers, > Till > > On Sat, Jan 5, 2019 at 2:22 AM Wenrui Meng <wenruim...@gmail.com> wrote: > >> Hi Till, >> >> Thanks for your reply and help on this issue. >> >> I increased taskmanager.network.netty.client.connectTimeoutSec to 1200 >> which is 20 minutes. But it seems the connection not respects this timeout. >> In addition, I increase both taskmanager.network.request-backoff.max >> and taskmanager.registration.max-backoff to 20min. >> >> One thing I found is helpful to some extent is increasing >> the taskmanager.network.netty.server.numThreads. I increase it to 128 >> threads, it can succeed sometimes. But keep increasing it doesn't solve the >> problem. We have 100 parallel intermediate results, so there are too many >> partition requests. I think that's why it timeout. The solution should let >> the connection timeout increase. But I think there is some issue that >> connection doesn't respect the timeout config. >> >> We will definitely try the latest flink version. But at Uber, there is a >> team who is responsible to provide a platform with Flink. They will upgrade >> it at the end of this Month. Meanwhile, I would like to ask some help to >> investigate how to increase the connection timeout and make it respected. >> >> Thanks, >> Wenrui >> >> On Fri, Jan 4, 2019 at 5:27 AM Till Rohrmann <trohrm...@apache.org> >> wrote: >> >>> Hi Wenrui, >>> >>> from the logs I cannot spot anything suspicious. Which configuration >>> parameters have you changed exactly? Does the JobManager log contain >>> anything suspicious? >>> >>> The current Flink version changed quite a bit wrt 1.4. Thus, it might be >>> worth a try to run the job with the latest Flink version. >>> >>> Cheers, >>> Till >>> >>> On Thu, Jan 3, 2019 at 3:00 PM Wenrui Meng <wenruim...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I consistently get connection timeout issue when creating >>>> partitionRequestClient in flink 1.4. I tried to ping from the connecting >>>> host to the connected host, but the ping latency is less than 0.1 ms >>>> consistently. So it's probably not due to the cluster status. I also tried >>>> increase max backoff, nettowrk timeout and some other setting, it doesn't >>>> help. >>>> >>>> I enabled the debug log of flink but not find any suspicious or useful >>>> information to help me fix the issue. Here is the link >>>> <https://www.dropbox.com/sh/sul62muz5pk0bqk/AABX8QbMrNmSq3k8I289mGmSa?dl=0> >>>> of the jobManager and taskManager logs. The connecting host is the host >>>> which throw the exception. The connected host is the host the connecting >>>> host try to request partition from. >>>> >>>> Since our platform is not up to date yet, the flink version I used in >>>> this is 1.4. But I noticed that there is not much change of these code on >>>> the Master branch. Any help will be appreciated. >>>> >>>> Here is stack trace of the exception >>>> >>>> from RUNNING to FAILED. >>>> java.io.IOException: Connecting the channel failed: Connecting to >>>> remote task manager + 'athena485-sjc1/10.70.132.8:34185' has failed. >>>> This might indicate that the remote task manager has been lost. >>>> at >>>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197) >>>> at >>>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:132) >>>> at >>>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:84) >>>> at >>>> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59) >>>> at >>>> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:156) >>>> at >>>> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:480) >>>> at >>>> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:502) >>>> at >>>> org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:93) >>>> at >>>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:214) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264) >>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) >>>> at java.lang.Thread.run(Thread.java:748) >>>> Caused by: >>>> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: >>>> Connecting to remote task manager + 'athena485-sjc1/10.70.132.8:34185' >>>> has failed. This might indicate that the remote task manager has been lost. >>>> at >>>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220) >>>> at >>>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:132) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:214) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) >>>> ... 1 common frames omitted >>>> Caused by: >>>> org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: >>>> connection timed out: athena485-sjc1/10.70.132.8:34185 >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:212) >>>> ... 6 common frames omitted >>>> >>>> Thanks, >>>> Wenrui >>>> >>>