Hi Piotrek, Thanks for your suggestions, I found some network issues which seems to be the cause of back pressure.
Best Weihua Hu > 2020年5月26日 02:54,Piotr Nowojski <pi...@ververica.com> 写道: > > Hi Weihua, > > > After dumping the memory and analyzing it, I found: > > Sink (121)'s RemoteInputChannel.unannouncedCredit = 0, > > Map (242)'s CreditBasedSequenceNumberingViewReader.numCreditsAvailable = 0. > > This is not consistent with my understanding of the Flink network > > transmission mechanism. > > It probably is consistent. Downstream receiver unannounced all of the > credits, and it’s simply waiting for the data to arrive, while upstream > sender is waiting for the data to be sent down the stream. > > Stack trace you posted confirms that the sink you posted has empty input > buffer - it’s waiting for input data. Assuming rescale partitoning works as > expected and indeed node 242 is connected to node 121, it implies the > bottleneck is your data exchange between those two tasks. It could be > > - network bottleneck (slow network? Packet losses?) > - machine swapping/long GC pauses (If upstream node is experiencing long > pauses it might show up like this) > - cpu bottleneck in the network stack (frequent flushing? SSL?) > - some resource competition (too high parallelism for given number of > machines) > - netty threads are not keeping up > > It’s hard to say what’s the problem without looking at the resource usage > (CPU/Network/Memory/Disk IO), GC logs, code profiling results. > > Piotrek > > PS Zhijiang: > > RescalePartitioner in this case should be connect just two upstream subtasks > with one downstream sink. Upstream subtasks N and N+1 should be connected to > sink with N/2 id. > >> On 25 May 2020, at 04:39, Weihua Hu <huweihua....@gmail.com >> <mailto:huweihua....@gmail.com>> wrote: >> >> Hi, Zhijiang >> >> I understand the normal credit-based backpressure mechanism. as usual the >> Sink inPoolUsage will be full, and the task stack will also have some >> information. >> but this time is not the same. The Sink inPoolUsage is 0. >> I also checked the stack. The Map is waiting >> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment >> The Sink is waiting data to deal, this is not very in line with expectations. >> >> >> <粘贴的图形-2.tiff> >> >> <粘贴的图形-1.tiff> >> >> >> >> Best >> Weihua Hu >> >>> 2020年5月24日 21:57,Zhijiang <wangzhijiang...@aliyun.com >>> <mailto:wangzhijiang...@aliyun.com>> 写道: >>> >>> Hi Weihua, >>> >>> From your below info, it is with the expectation in credit-based flow >>> control. >>> >>> I guess one of the sink parallelism causes the backpressure, so you will >>> see that there are no available credits on Sink side and >>> the outPoolUsage of Map is almost 100%. It really reflects the credit-based >>> states in the case of backpressure. >>> >>> If you want to analyze the root cause of backpressure, you can trace the >>> task stack of respective Sink parallelism to find which operation costs >>> much, >>> then you can increase the parallelism or improve the UDF(if have >>> bottleneck) to have a try. In addition, i am not sure why you choose >>> rescale to shuffle data among operators. The default >>> forward mode can gain really good performance by default if you adjusting >>> the same parallelism among them. >>> >>> Best, >>> Zhijiang >>> ------------------------------------------------------------------ >>> From:Weihua Hu <huweihua....@gmail.com <mailto:huweihua....@gmail.com>> >>> Send Time:2020年5月24日(星期日) 18:32 >>> To:user <user@flink.apache.org <mailto:user@flink.apache.org>> >>> Subject:Singal task backpressure problem with Credit-based Flow Control >>> >>> Hi, all >>> >>> I ran into a weird single Task BackPressure problem. >>> >>> JobInfo: >>> DAG: Source (1000)-> Map (2000)-> Sink (1000), which is linked via >>> rescale. >>> Flink version: 1.9.0 >>> >>> There is no related info in jobmanager/taskamanger log. >>> >>> Through Metrics, I see that Map (242) 's outPoolUsage is full, but its >>> downstream Sink (121)' s inPoolUsage is 0. >>> >>> After dumping the memory and analyzing it, I found: >>> Sink (121)'s RemoteInputChannel.unannouncedCredit = 0, >>> Map (242)'s CreditBasedSequenceNumberingViewReader.numCreditsAvailable = 0. >>> This is not consistent with my understanding of the Flink network >>> transmission mechanism. >>> >>> Can someone help me? Thanks a lot. >>> >>> >>> Best >>> Weihua Hu >>> >>> >> >