Re: ConnectTimeoutException when createPartitionRequestClient

2019-01-09 Thread Wenrui Meng
- > From:Wenrui Meng > Send Time:2019年1月9日(星期三) 18:18 > To:Till Rohrmann > Cc:user ; Konstantin > Subject:Re: ConnectTimeoutException when createPartitionRequestClient > > Hi Till, > > This job is not on AthenaX but on a special uber version Flink. I tried

Re: ConnectTimeoutException when createPartitionRequestClient

2019-01-09 Thread Wenrui Meng
Hi Till, I will try the local test according to your suggestion. Uber Flink version is mainly adding something to integrate with Uber deployment and other infra components. There is no change for Flink original code flow. I also found that the issue can be avoided with the same setting in other c

Re: ConnectTimeoutException when createPartitionRequestClient

2019-01-09 Thread zhijiang
number of netty thread and timeout should make sense for normal cases. Best, Zhijiang -- From:Wenrui Meng Send Time:2019年1月9日(星期三) 18:18 To:Till Rohrmann Cc:user ; Konstantin Subject:Re: ConnectTimeoutException when

Re: ConnectTimeoutException when createPartitionRequestClient

2019-01-09 Thread Wenrui Meng
Hi Till, This job is not on AthenaX but on a special uber version Flink. I tried to ping the connected host from connecting host. It seems very stable. For the connection timeout, I do set it as 20min but it still report the timeout after 2 minutes. Could you let me know how do you test locally ab

Re: ConnectTimeoutException when createPartitionRequestClient

2019-01-09 Thread Till Rohrmann
Hi Wenrui, I executed AutoParallelismITCase#testProgramWithAutoParallelism and set a breakpoint in NettClient.java:102 to see whether the configured timeout value is correctly set. Moreover, I did the same for AbstractNioChannel.java:207 and it looked as if the correct timeout value was set. What

Re: ConnectTimeoutException when createPartitionRequestClient

2019-01-08 Thread Till Rohrmann
Hi Wenrui, the exception now occurs while finishing the connection creation. I'm not sure whether this is so different. Could it be that your network is overloaded or not very reliable? Have you tried running your Flink job outside of AthenaX? Cheers, Till On Tue, Jan 8, 2019 at 2:50 PM Wenrui M

Re: ConnectTimeoutException when createPartitionRequestClient

2019-01-08 Thread Wenrui Meng
Hi Till, Thanks for your reply. Our cluster is Yarn cluster. I found that if we decrease the total parallel the timeout issue can be avoided. But we do need that amount of taskManagers to process data. In addition, once I increase the netty server threads to 128, the error is changed to to followi

Re: ConnectTimeoutException when createPartitionRequestClient

2019-01-07 Thread Till Rohrmann
Hi Wenrui, the code to set the connect timeout looks ok to me [1]. I also tested it locally and checked that the timeout is correctly registered in Netty's AbstractNioChannel [2]. Increasing the number of threads to 128 should not be necessary. But it could indicate that there is some long lastin

Re: ConnectTimeoutException when createPartitionRequestClient

2019-01-04 Thread Wenrui Meng
Hi Till, Thanks for your reply and help on this issue. I increased taskmanager.network.netty.client.connectTimeoutSec to 1200 which is 20 minutes. But it seems the connection not respects this timeout. In addition, I increase both taskmanager.network.request-backoff.max and taskmanager.registrati

Re: ConnectTimeoutException when createPartitionRequestClient

2019-01-04 Thread Till Rohrmann
Hi Wenrui, from the logs I cannot spot anything suspicious. Which configuration parameters have you changed exactly? Does the JobManager log contain anything suspicious? The current Flink version changed quite a bit wrt 1.4. Thus, it might be worth a try to run the job with the latest Flink versi

ConnectTimeoutException when createPartitionRequestClient

2019-01-03 Thread Wenrui Meng
Hi, I consistently get connection timeout issue when creating partitionRequestClient in flink 1.4. I tried to ping from the connecting host to the connected host, but the ping latency is less than 0.1 ms consistently. So it's probably not due to the cluster status. I also tried increase max backof