Hi Weihua, Good to hear that you have found the problem. Let us know if you find some other problems after all.
Piotrek > On 27 May 2020, at 14:18, Weihua Hu <huweihua....@gmail.com> wrote: > > Hi Piotrek, > > Thanks for your suggestions, I found some network issues which seems to be > the cause of back pressure. > > Best > Weihua Hu > >> 2020年5月26日 02:54,Piotr Nowojski <pi...@ververica.com >> <mailto:pi...@ververica.com>> 写道: >> >> Hi Weihua, >> >> > After dumping the memory and analyzing it, I found: >> > Sink (121)'s RemoteInputChannel.unannouncedCredit = 0, >> > Map (242)'s CreditBasedSequenceNumberingViewReader.numCreditsAvailable = 0. >> > This is not consistent with my understanding of the Flink network >> > transmission mechanism. >> >> It probably is consistent. Downstream receiver unannounced all of the >> credits, and it’s simply waiting for the data to arrive, while upstream >> sender is waiting for the data to be sent down the stream. >> >> Stack trace you posted confirms that the sink you posted has empty input >> buffer - it’s waiting for input data. Assuming rescale partitoning works as >> expected and indeed node 242 is connected to node 121, it implies the >> bottleneck is your data exchange between those two tasks. It could be >> >> - network bottleneck (slow network? Packet losses?) >> - machine swapping/long GC pauses (If upstream node is experiencing long >> pauses it might show up like this) >> - cpu bottleneck in the network stack (frequent flushing? SSL?) >> - some resource competition (too high parallelism for given number of >> machines) >> - netty threads are not keeping up >> >> It’s hard to say what’s the problem without looking at the resource usage >> (CPU/Network/Memory/Disk IO), GC logs, code profiling results. >> >> Piotrek >> >> PS Zhijiang: >> >> RescalePartitioner in this case should be connect just two upstream subtasks >> with one downstream sink. Upstream subtasks N and N+1 should be connected to >> sink with N/2 id. >> >>> On 25 May 2020, at 04:39, Weihua Hu <huweihua....@gmail.com >>> <mailto:huweihua....@gmail.com>> wrote: >>> >>> Hi, Zhijiang >>> >>> I understand the normal credit-based backpressure mechanism. as usual the >>> Sink inPoolUsage will be full, and the task stack will also have some >>> information. >>> but this time is not the same. The Sink inPoolUsage is 0. >>> I also checked the stack. The Map is waiting >>> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment >>> The Sink is waiting data to deal, this is not very in line with >>> expectations. >>> >>> >>> <粘贴的图形-2.tiff> >>> >>> <粘贴的图形-1.tiff> >>> >>> >>> >>> Best >>> Weihua Hu >>> >>>> 2020年5月24日 21:57,Zhijiang <wangzhijiang...@aliyun.com >>>> <mailto:wangzhijiang...@aliyun.com>> 写道: >>>> >>>> Hi Weihua, >>>> >>>> From your below info, it is with the expectation in credit-based flow >>>> control. >>>> >>>> I guess one of the sink parallelism causes the backpressure, so you will >>>> see that there are no available credits on Sink side and >>>> the outPoolUsage of Map is almost 100%. It really reflects the >>>> credit-based states in the case of backpressure. >>>> >>>> If you want to analyze the root cause of backpressure, you can trace the >>>> task stack of respective Sink parallelism to find which operation costs >>>> much, >>>> then you can increase the parallelism or improve the UDF(if have >>>> bottleneck) to have a try. In addition, i am not sure why you choose >>>> rescale to shuffle data among operators. The default >>>> forward mode can gain really good performance by default if you adjusting >>>> the same parallelism among them. >>>> >>>> Best, >>>> Zhijiang >>>> ------------------------------------------------------------------ >>>> From:Weihua Hu <huweihua....@gmail.com <mailto:huweihua....@gmail.com>> >>>> Send Time:2020年5月24日(星期日) 18:32 >>>> To:user <user@flink.apache.org <mailto:user@flink.apache.org>> >>>> Subject:Singal task backpressure problem with Credit-based Flow Control >>>> >>>> Hi, all >>>> >>>> I ran into a weird single Task BackPressure problem. >>>> >>>> JobInfo: >>>> DAG: Source (1000)-> Map (2000)-> Sink (1000), which is linked via >>>> rescale. >>>> Flink version: 1.9.0 >>>> >>>> There is no related info in jobmanager/taskamanger log. >>>> >>>> Through Metrics, I see that Map (242) 's outPoolUsage is full, but its >>>> downstream Sink (121)' s inPoolUsage is 0. >>>> >>>> After dumping the memory and analyzing it, I found: >>>> Sink (121)'s RemoteInputChannel.unannouncedCredit = 0, >>>> Map (242)'s CreditBasedSequenceNumberingViewReader.numCreditsAvailable = 0. >>>> This is not consistent with my understanding of the Flink network >>>> transmission mechanism. >>>> >>>> Can someone help me? Thanks a lot. >>>> >>>> >>>> Best >>>> Weihua Hu >>>> >>>> >>> >> >