Re: [ANNOUNCE] Apache Flink 1.10.2 released

2020-08-27 Thread Zhijiang
Congrats, thanks for the release manager work Zhu Zhu and everyone involved in! Best, Zhijiang -- From:liupengcheng Send Time:2020年8月26日(星期三) 19:37 To:dev ; Xingbo Huang Cc:Guowei Ma ; user-zh ; Yangze Guo ; Dian Fu ; Zhu Zhu

Re: Re: [ANNOUNCE] New PMC member: Dian Fu

2020-08-27 Thread Zhijiang
Congrats, Dian! -- From:Yun Gao Send Time:2020年8月27日(星期四) 17:44 To:dev ; Dian Fu ; user ; user-zh Subject:Re: Re: [ANNOUNCE] New PMC member: Dian Fu Congratulations Dian ! Best Yun ---

Re: Adaptive load balancing

2020-09-23 Thread Zhijiang
custom partitioner which can control the logic of keyBy distribution based on pre-defined cache distribution in nodes? Best, Zhijiang -- From:Navneeth Krishnan Send Time:2020年9月23日(星期三) 02:21 To:user Subject:Adaptive load balancing

Re: Backpressure and 99th percentile latency

2020-03-05 Thread Zhijiang
ssure, it should go down to milliseconds. Best, Zhijiang -- From:Felipe Gutierrez Send Time:2020 Mar. 6 (Fri.) 05:04 To:user Subject:Backpressure and 99th percentile latency Hi, I am a bit confused about the topic of tracking la

Re: Backpressure and 99th percentile latency

2020-03-07 Thread Zhijiang
be increased along with the trend of increased input&outputQueueLength and input&outputPoolUsage. All of them should be proportional to have the same trend in most cases. Best, Zhijiang -- From:Felipe Gutierrez Send

Re: Help me understand this Exception

2020-03-18 Thread Zhijiang
e from the wrapped real exception, then users can easily get the root cause directly, not only for the current message "Could not forward element to next operator". Best, Zhijiang -- From:Tzu-Li (Gordon) Tai Send Time:20

Re: How do I get the outPoolUsage value inside my own stream operator?

2020-03-18 Thread Zhijiang
skMetricGroup. Hope it solve your problem. Best, Zhijiang -- From:Felipe Gutierrez Send Time:2020 Mar. 17 (Tue.) 17:50 To:user Subject:Re: How do I get the outPoolUsage value inside my own stream operator? Hi, just for the recor

Re: FlinkCEP - Detect absence of a certain event

2020-03-18 Thread Zhijiang
Hi Humberto, I guess Fuji is familiar with Flink CEP and he can answer your proposed question. I already cc him. Best, Zhijiang -- From:Humberto Rodriguez Avila Send Time:2020 Mar. 18 (Wed.) 17:31 To:user Subject:FlinkCEP

Re: End to End Latency Tracking in flink

2020-03-29 Thread Zhijiang
Hi Lu, Besides Congxian's replies, you can also get some further explanations from "https://flink.apache.org/2019/07/23/flink-network-stack-2.html#latency-tracking";. Best, Zhijiang -- From:Congxian Qiu Send Ti

Re: [ANNOUNCE] Flink on Zeppelin (Zeppelin 0.9 is released)

2020-03-30 Thread Zhijiang
Thanks for the continuous efforts for engaging in Flink ecosystem Jeff! Glad to see the progressive achievement. Wish more users try it out in practice. Best, Zhijiang -- From:Dian Fu Send Time:2020 Mar. 31 (Tue.) 10:15 To:Jeff

Re: [ANNOUNCE] Apache Flink Stateful Functions 2.0.0 released

2020-04-09 Thread Zhijiang
Great work! Thanks Gordon for the continuous efforts for enhancing stateful functions and the efficient release! Wish stateful functions becoming more and more popular in users. Best, Zhijiang -- From:Yun Tang Send Time:2020 Apr

Re: [ANNOUNCE] Apache Flink 1.9.3 released

2020-04-27 Thread Zhijiang
Thanks Dian for the release work and thanks everyone involved. Best, Zhijiang -- From:Till Rohrmann Send Time:2020 Apr. 27 (Mon.) 15:13 To:Jingsong Li Cc:dev ; Leonard Xu ; Benchao Li ; Konstantin Knauf ; jincheng sun ; Hequn

Re: Flink Streaming Job Tuning help

2020-05-12 Thread Zhijiang
work related metrics. Whether there are data skew in your case, that means some task would process more records than others. If so, maybe we can increase the parallelism to balance the load. Best, Zhijiang -- From:Senthil Kumar Send

Re: [ANNOUNCE] Apache Flink 1.10.1 released

2020-05-18 Thread Zhijiang
Thanks Yu for the release manager and everyone involved in. Best, Zhijiang -- From:Arvid Heise Send Time:2020年5月18日(星期一) 23:17 To:Yangze Guo Cc:dev ; Apache Announce List ; user ; Yu Li ; user-zh Subject:Re: [ANNOUNCE] Apache

Re: Singal task backpressure problem with Credit-based Flow Control

2020-05-24 Thread Zhijiang
am not sure why you choose rescale to shuffle data among operators. The default forward mode can gain really good performance by default if you adjusting the same parallelism among them. Best, Zhijiang -- From:Weihua Hu Send Time

Re: Flink 1.10 memory and backpressure

2020-06-10 Thread Zhijiang
Regarding the monitor of backpressure, you can refer to the document [1]. As for debugging the backpressure, one option is to trace the jstack of respective window task thread which causes the backpressure(almost has the maximum inqueue buffers). After frequent tracing the jstack, you might find

Re: Flink 1.10 memory and backpressure

2020-06-10 Thread Zhijiang
Sorry for missing the document link [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/back_pressure.html -- From:Zhijiang Send Time:2020年6月11日(星期四) 11:32 To:Steven Nelson ; user Subject:Re: Flink 1.10 mem

Re: Blocked requesting MemorySegment when Segments are available.

2020-06-10 Thread Zhijiang
n" operator is more than 0? I want to get ride of the factors of buffer leak on upstream side and without partition request on downstream side. Then we can further allocate whether the input availability notification on downstream s

Re: [ANNOUNCE] Yu Li is now part of the Flink PMC

2020-06-16 Thread Zhijiang
Congratulations Yu! Well deserved! Best, Zhijiang -- From:Dian Fu Send Time:2020年6月17日(星期三) 10:48 To:dev Cc:Haibo Sun ; user ; user-zh Subject:Re: [ANNOUNCE] Yu Li is now part of the Flink PMC Congrats Yu! Regards, Dian >

Re: Unaligned Checkpoint and Exactly Once

2020-06-21 Thread Zhijiang
using the first config form. But somehow they seem two different dimensions for config the checkpoint. One is for the semantic of data processing guarantee. And the other is for how we realize two different mechanisms to guarantee one (exactly-once) of the semantics. Best, Zhijiang

Re: Unaligned Checkpoint and Exactly Once

2020-06-21 Thread Zhijiang
:2020年6月22日(星期一) 10:53 To:Zhijiang ; user@flink.apache.org Subject:回复: Unaligned Checkpoint and Exactly Once Thank you Zhijiang! The second question about config is just because I find a method in InputProcessorUtil. I guess AT_LEAST_ONCE mode is a simpler way to handle checkpont barrier

[ANNOUNCE] Apache Flink 1.11.0 released

2020-07-07 Thread Zhijiang
://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364 We would like to thank all contributors of the Apache Flink community who made this release possible! Cheers, Piotr & Zhijiang

Re: [ANNOUNCE] Apache Flink 1.11.1 released

2020-07-22 Thread Zhijiang
Thanks for being the release manager and the efficient work, Dian! Best, Zhijiang -- From:Konstantin Knauf Send Time:2020年7月22日(星期三) 19:55 To:Till Rohrmann Cc:dev ; Yangze Guo ; Dian Fu ; user ; user-zh Subject:Re: [ANNOUNCE

Re: Will broadcast stream affect performance because of the absence of operator chaining?

2019-08-06 Thread zhijiang
your acception or not. If the performance is not reaching your requirements, we could further consider other improvements. Best, Zhijiang -- From:Piotr Nowojski Send Time:2019年8月6日(星期二) 14:55 To:黄兆鹏 Cc:user Subject:Re: Will

Re: [ANNOUNCE] Andrey Zagrebin becomes a Flink committer

2019-08-14 Thread zhijiang
Congratulations Andrey, great work and well deserved! Best, Zhijiang -- From:Till Rohrmann Send Time:2019年8月14日(星期三) 15:26 To:dev ; user Subject:[ANNOUNCE] Andrey Zagrebin becomes a Flink committer Hi everyone, I'm very hap

Re: [ANNOUNCE] Zili Chen becomes a Flink committer

2019-09-11 Thread zhijiang
Congratulations Zili! -- From:Becket Qin Send Time:2019年9月12日(星期四) 03:43 To:Paul Lam Cc:Rong Rong ; dev ; user Subject:Re: [ANNOUNCE] Zili Chen becomes a Flink committer Congrats, Zili! On Thu, Sep 12, 2019 at 9:39 AM Paul Lam wr

Re: How can I get the backpressure signals inside my function or operator?

2019-11-05 Thread Zhijiang
x27;s metric of `Shuffle.Netty.Input.Buffers.inputQueueLength` on preAggregate side, you might rely on some external metric reporter to query it if possible. Best, Zhijiang -- From:Felipe Gutierrez Send Time:2019 Nov. 5 (Tue.

Re: What metrics can I see the root cause of "Buffer pool is destroyed" message?

2019-11-06 Thread Zhijiang
lure which triggers the following cancel operations. In addition, which flink version are you using? Best, Zhijiang -- From:Felipe Gutierrez Send Time:2019 Nov. 6 (Wed.) 19:12 To:user Subject:What metrics can I see the root cause

Re: How can I get the backpressure signals inside my function or operator?

2019-11-06 Thread Zhijiang
monitoring/rest_api.html Best, Zhijiang -- From:Felipe Gutierrez Send Time:2019 Nov. 7 (Thu.) 00:06 To:Chesnay Schepler Cc:Zhijiang ; user Subject:Re: How can I get the backpressure signals inside my function or operator? If I can t

Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

2019-11-20 Thread Zhijiang
stack, especially it really spans several releases. Best, Zhijiang -- From:Hailu, Andreas Send Time:2019 Nov. 21 (Thu.) 01:03 To:user@flink.apache.org Subject:RE: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1

Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

2019-11-21 Thread Zhijiang
. // ah From: Piotr Nowojski On Behalf Of Piotr Nowojski Sent: Thursday, November 21, 2019 10:14 AM To: Hailu, Andreas [Engineering] Cc: Zhijiang ; user@flink.apache.org Subject: Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1? Hi, I would suspect this: https://issues.apache.org

Re: Flink task node shut it self off.

2019-12-22 Thread Zhijiang
increase your task manager memory. But if you can analyze the dump hs_err file via some profiler tool for checking the memory usage, it might be more helpful to find the root cause. Best, Zhijiang -- From:John Smith Send Time:2019 Dec

Re: Flink task node shut it self off.

2019-12-24 Thread Zhijiang
. -- From:John Smith Send Time:2019 Dec. 25 (Wed.) 03:40 To:Zhijiang Cc:user Subject:Re: Flink task node shut it self off. The shutdown happened after the massive IO wait. I don't use any state Checkpoints are disk based... On Mon., Dec. 23, 2019, 1:42 a.m. Zhijiang, wrote: Hi

Re: Rewind offset to a previous position and ensure certainty.

2019-12-25 Thread Zhijiang
not support such operation/function atm. :) [1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/event_timestamps_watermarks.html Best, Zhijiang -- From:邢瑞斌 Send Time:2019 Dec. 25 (Wed.) 20:27 To:user-zh ; user

Re: Aggregating Movie Rental information in a DynamoDB table using DynamoDB streams and Flink

2019-12-25 Thread Zhijiang
addition, the StreamingFileSink also implements the exactly-once for sink. You might also refer to it to get some insights if possible. Best, Zhijiang -- From:Joe Hansen Send Time:2019 Dec. 26 (Thu.) 01:42 To:user Subject:Aggregating

Re: How to verify if checkpoints are asynchronous or sync

2020-01-07 Thread Zhijiang
rming whether it is happening, but maybe it is not very convenient. Another possible way is via the checkpoint metrics which would record the sync/async duration time, maybe it can also satisfy your requirements. Best, Zhi

Re: How can I find out which key group belongs to which subtask

2020-01-10 Thread Zhijiang
implemented some enhancements in scheduler layer to support such requirement in release-1.10. You can have a try when the rc candidate is ready. Best, Zhijiang -- From:杨东晓 Send Time:2020 Jan. 10 (Fri.) 02:10 To:Congxian Qiu Cc:user

Re: Exactly once semantics for hdfs sink

2020-02-11 Thread Zhijiang
Hi Vishwas, I guess this link [1] can help understand how it works and how to use in practice for StreamingFileSink. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/streamfile_sink.html Best, Zhijiang

Re: [ANNOUNCE] Apache Flink 1.10.0 released

2020-02-12 Thread Zhijiang
Really great work and thanks everyone involved, especially for the release managers! Best, Zhijiang -- From:Kurt Young Send Time:2020 Feb. 13 (Thu.) 11:06 To:[None] Cc:user ; dev Subject:Re: [ANNOUNCE] Apache Flink 1.10.0 released

Re: Encountered error while consuming partitions

2020-02-13 Thread Zhijiang
Best, Zhijiang -- From:张光辉 Send Time:2020 Feb. 12 (Wed.) 22:19 To:Benchao Li Cc:刘建刚 ; user Subject:Re: Encountered error while consuming partitions Network can fail in many ways, sometimes pretty subtle (e.g. high ratio packet

Re: The mapping relationship between Checkpoint subtask id and Task subtask id

2020-02-13 Thread Zhijiang
If the id is not consistent in different parts, maybe it is worth creating a jira ticket for better improving the user experience. If anyone wants to work on it, please ping me then I can give a hand. Best, Zhijiang -- From:Yun Tang

Re: Re: The mapping relationship between Checkpoint subtask id and Task subtask id

2020-02-13 Thread Zhijiang
Let's move the further discussion onto the jira page. I have not much time recently for working on this. If you want to take it, I can assign it to you and help review the PR if have time then. Or I can find other possible guys work on it future. Best, Zhi

Re: Re: The mapping relationship between Checkpoint subtask id and Task subtask id

2020-02-13 Thread Zhijiang
BTW, the FLIP-75 is going for the user experience of web UI. @Yadong Xiehave we already considered this issue to unify the ids in different parts in FLIP-75? Best, Zhijiang -- From:Zhijiang Send Time:2020 Feb. 14 (Fri.) 13:03

Re: [ANNOUNCE] Jingsong Lee becomes a Flink committer

2020-02-20 Thread Zhijiang
Congrats Jingsong! Welcome on board! Best, Zhijiang -- From:Zhenghua Gao Send Time:2020 Feb. 21 (Fri.) 12:49 To:godfrey he Cc:dev ; user Subject:Re: [ANNOUNCE] Jingsong Lee becomes a Flink committer Congrats Jingsong! Best

Re: How JobManager and TaskManager find each other?

2020-02-26 Thread Zhijiang
ld be sent to the required peers during task schedule and deployment. Best, Zhijiang -- From:KristoffSC Send Time:2020 Feb. 26 (Wed.) 19:39 To:user Subject:Re: How JobManager and TaskManager find each other? Thanks all for the an

Re: Question: Determining Total Recovery Time

2020-02-26 Thread Zhijiang
metrics are `0` in light-weight situation, which i mentioned above. So we can not estimate the saturation unless we increase the source emit. Wish good news sharing from you! Best, Zhijiang -- From:Arvid Heise Send Time:2020 Feb

回复:checkpoint/taskmanager is stuck, deadlock on LocalBufferPool

2018-10-23 Thread zhijiang
output buffers are not consumed by downstream tasks. I think you can check the downstream task which inqueue usage should reach 100%, then jstack the corresponding downstream tasks that may stuck in some operations to cause back pressure. Best, Zhijiang

回复:OutOfMemoryError while doing join operation in flink

2018-11-22 Thread zhijiang
, Zhijiang -- 发件人:Akshay Mendole 发送时间:2018年11月22日(星期四) 13:43 收件人:user 主 题:OutOfMemoryError while doing join operation in flink Hi, We are converting one of our pig pipelines to flink using apache beam. The pig pipeline reads two

回复:OutOfMemoryError while doing join operation in flink

2018-11-22 Thread zhijiang
So you can decrease the "taskmanager.memory.fraction" in low fraction or increase the total task manager to cover this overhead memories, or set one slot for each task manager. Best, Zhijiang -- 发件人:Akshay Mendole 发送时间

回复:Flink job failing due to "Container is running beyond physical memory limits" error.

2018-11-25 Thread zhijiang
I think it is probably related with rockdb memory usage if you have not found OutOfMemory issue before. There already existed a jira ticket [1] for fixing this issue, and you can watch it for updates. :) [1] https://issues.apache.org/jira/browse/FLINK-10884 Best, Zhijiang

回复:回复:Flink job failing due to "Container is running beyond physical memory limits" error.

2018-11-26 Thread zhijiang
It may work aournd by increasing the task manager memory size. The recover failure is up to serveral issues, whether it had successful checkpoint before, the states are available and what is the failover strategy? Best, Zhijiang

回复:buffer pool is destroyed

2018-12-21 Thread zhijiang
Hi Shuang, Normally this exception you mentioned is not the root cause of failover, and it is mainly caused by cancel process to make task exit. You can further check whether there are other failures in job master log to find the root cause. Best, Zhijiang

Re: Buffer stats when Back Pressure is high

2019-01-07 Thread zhijiang
B is empty. Best, Zhijiang -- From:Gagan Agrawal Send Time:2019年1月7日(星期一) 12:06 To:user Subject:Buffer stats when Back Pressure is high Hi, I want to understand does any of buffer stats help in debugging / validating that

Re: ConnectTimeoutException when createPartitionRequestClient

2019-01-09 Thread zhijiang
number of netty thread and timeout should make sense for normal cases. Best, Zhijiang -- From:Wenrui Meng Send Time:2019年1月9日(星期三) 18:18 To:Till Rohrmann Cc:user ; Konstantin Subject:Re: ConnectTimeoutException when

Re: [ANNOUNCE] New Flink PMC member Thomas Weise

2019-02-12 Thread zhijiang
Congrats Thomas! Best, Zhijiang -- From:Kostas Kloudas Send Time:2019年2月12日(星期二) 22:46 To:Jark Wu Cc:Hequn Cheng ; Stefan Richter ; user Subject:Re: [ANNOUNCE] New Flink PMC member Thomas Weise Congratulations Thomas! Best

Re: [DISCUSS] Adding a mid-term roadmap to the Flink website

2019-02-14 Thread zhijiang
the interested one and handle the progress of it. Best, Zhijiang -- From:Jeff Zhang Send Time:2019年2月14日(星期四) 18:03 To:Stephan Ewen Cc:dev ; user ; jincheng sun ; Shuyi Chen ; Rong Rong Subject:Re: [DISCUSS] Adding a mid-term

Re: Confusion in Heartbeat configurations

2019-02-18 Thread zhijiang
implementation prefix with `akka` and the other is flink internal implementation. Best, Zhijiang -- From:sohimankotia Send Time:2019年2月18日(星期一) 14:40 To:user Subject:Confusion in Heartbeat configurations Hi, In https://ci.apache.org

Re: Flink performance drops when async checkpoint is slow

2019-02-28 Thread zhijiang
task TPS are not decreased for a period as before, I think we could confirm the above analysis. :) Best, Zhijiang -- From:Paul Lam Send Time:2019年2月28日(星期四) 15:17 To:user Subject:Flink performance drops when async checkpoint is

Re: Checkpoints and catch-up burst (heavy back pressure)

2019-02-28 Thread zhijiang
is suitable for your scenario and may have a try. Best, Zhijiang -- From:LINZ, Arnaud Send Time:2019年2月28日(星期四) 17:28 To:user Subject:Checkpoints and catch-up burst (heavy back pressure) Hello, I have a simple streaming app that get da

Re: Flink performance drops when async checkpoint is slow

2019-02-28 Thread zhijiang
which operation delays the task to cause the backpressure, and this operation might be involved with HDFS. :) Best, Zhijiang -- From:Paul Lam Send Time:2019年2月28日(星期四) 19:17 To:zhijiang Cc:user Subject:Re: Flink performance drops

Re: Checkpoints and catch-up burst (heavy back pressure)

2019-02-28 Thread zhijiang
one single downstream task (`a` is the parallelism of source vertex), because it is all-to-all connection. The barrier alignment takes more time in rebalance mode than forward mode. Best, Zhijiang -- From:LINZ, Arnaud Send Time

Re: Checkpoints and catch-up burst (heavy back pressure)

2019-03-03 Thread zhijiang
queued in front of barriers. This is the right way to try and wish your solution with 2 parameters work. Best, Zhijiang -- From:LINZ, Arnaud Send Time:2019年3月2日(星期六) 16:45 To:zhijiang ; user Subject:RE: Checkpoints and catch-up

Re: Flink credit based flow control

2019-03-11 Thread zhijiang
what is the flush timeout you config. Also you can trace the current metrics of outqueue.usages|length and inqueue.usags|length to find something. Best, Zhijiang -- From:Brian Ramprasad Send Time:2019年3月12日(星期二) 03:47 To:user

Re: [ANNOUNCE] Apache Flink 1.8.0 released

2019-04-10 Thread zhijiang
Cool! Finally see the FLINK 1.8.0 release. Thanks Aljoscha for this excellent work and efforts for other contributors. We would continue working hard for FLINK 1.9.0 Best, Zhijiang -- From:vino yang Send Time:2019年4月10日(星期三) 17

Re: Question regarding "Insufficient number of network buffers"

2019-04-11 Thread zhijiang
ted. Best, Zhijiang -- From:Xiangfeng Zhu Send Time:2019年4月12日(星期五) 08:03 To:user Subject:Question regarding "Insufficient number of network buffers" Hello, My name is Allen, and I'm currently researching different distr

Re: Netty channel closed at AKKA gated status

2019-04-14 Thread zhijiang
Hi Wenrui, I think the akka gated issue and inactive netty channel are both caused by some task manager exits/killed. You should double check the status and reason of this task manager `'athena592-phx2/10.80.118.166:44177'`. Best

Re: Retain metrics counters across task restarts

2019-04-14 Thread zhijiang
restarted. Best, Zhijiang -- From:Peter Zende Send Time:2019年4月14日(星期日) 00:25 To:user Subject:Retain metrics counters across task restarts Hi all We're exposing Prometheus metrics from our Flink (v1.7.1) pipeline to Prome

Re: Can back pressure data be gathered by Flink metric system?

2019-04-14 Thread zhijiang
Hi Henry, The backpressure tracking is not realized in metric framework, you could check the details via [1]. I am not sure why your requirements is showing backpressure in metrics. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/back_pressure.html Best, Zhijiang

Re: Can back pressure data be gathered by Flink metric system?

2019-04-15 Thread zhijiang
length/usage on consumer side. Although it is not very accurate sometimes, it could provide some hints of backpressure, because the outqueue and inqueue should be filled with buffers between producer and consumer when backpressure occurs. Best, Zhijiang

Re: Netty channel closed at AKKA gated status

2019-04-15 Thread zhijiang
Hi Wenrui, You might further check whether there exists network connection issue between job master and target task executor if you confirm the target task executor is still alive. Best, Zhijiang -- From:Biao Liu Send Time:2019年4

Re: Netty channel closed at AKKA gated status

2019-04-21 Thread zhijiang
Hi Wenrui, I think you could trace the log of node manager which contains the lifecycle of this task executor. Maybe this task executor is killed by node manager because of memory overuse. Best, Zhijiang -- From:Wenrui Meng Send

Re: 回复:Memory Allocate/Deallocate related Thread Deadlock encountered when running a large job > 10k tasks

2019-05-17 Thread zhijiang
one task triggers another task to release memory in the same TM. Or you could increase the network buffer setting to work aournd, but not sure this way could work for your case because it is up to the total data size the source produced. Best, Zhijiang

Re: 回复:Memory Allocate/Deallocate related Thread Deadlock encountered when running a large job > 10k tasks

2019-05-17 Thread zhijiang
bugs in previous flink versions. [1] https://issues.apache.org/jira/browse/FLINK-12544 Best, Zhijiang -- From:Narayanaswamy, Krishna Send Time:2019年5月17日(星期五) 19:00 To:zhijiang ; Aljoscha Krettek ; Piotr Nowojski Cc:Nico Kruber

Re: 回复:Memory Allocate/Deallocate related Thread Deadlock encountered when running a large job > 10k tasks

2019-05-21 Thread zhijiang
Hi Krishna, Could you show me or attach the jstack for the single slot case? Or is it the same jstack as before? Best, Zhijiang -- From:Narayanaswamy, Krishna Send Time:2019年5月21日(星期二) 19:50 To:zhijiang ; Aljoscha Krettek

Re: 回复:Memory Allocate/Deallocate related Thread Deadlock encountered when running a large job > 10k tasks

2019-05-21 Thread zhijiang
, Zhijiang -- From:Narayanaswamy, Krishna Send Time:2019年5月22日(星期三) 00:49 To:zhijiang ; Aljoscha Krettek ; Piotr Nowojski Cc:Nico Kruber ; user@flink.apache.org ; "Chan, Regina" ; "Erai, Rahul" Subject:RE

Re: 回复:Memory Allocate/Deallocate related Thread Deadlock encountered when running a large job > 10k tasks

2019-06-04 Thread zhijiang
for review atm. You could pick the code in PR to verfiy the results if you like. And the next release-1.8.1 might cover this fix as well. Best, Zhijiang -- From:Erai, Rahul Send Time:2019年6月4日(星期二) 15:50 To:zhijiang ; Aljoscha

Re: 回复:Memory Allocate/Deallocate related Thread Deadlock encountered when running a large job > 10k tasks

2019-06-04 Thread zhijiang
ishna" Cc:Nico Kruber ; user@flink.apache.org ; "Chan, Regina" Subject:RE: 回复:Memory Allocate/Deallocate related Thread Deadlock encountered when running a large job > 10k tasks Thanks Zhijiang. Can you point us to the JIRA for your fix? Regards, -Rahul From: zhijiang Sent: Tue

Re: [DISCUSS] Deprecate previous Python APIs

2019-06-11 Thread zhijiang
It is reasonable as stephan explained. +1 from my side! -- From:Jeff Zhang Send Time:2019年6月11日(星期二) 22:11 To:Stephan Ewen Cc:user ; dev Subject:Re: [DISCUSS] Deprecate previous Python APIs +1 Stephan Ewen 于2019年6月11日周二 下午9:30写道

Re: Apache Flink - Disabling system metrics and collecting only specific metrics

2019-06-11 Thread zhijiang
could implement a custom MetricReporter, and then only consentrate on your required application metrics in the method of `MetricReporter#notifyOfAddedMetric` to show them in backend. Best, Zhijiang -- From:M Singh Send Time:2019年6月

Re: java.io.FileNotFoundException in implementing exactly once

2019-06-11 Thread zhijiang
you could double check this dir for the issue. In addition I suggestt you upgrading the flink version because flink-1.3.3 is too old. After upgrading to flink-1.5 above, you do not need to consider this issue, because the exactly-once mode would not spill data to disk any more. Best, Zhijiang

Re: A little doubt about the blog A Deep To Flink's NetworkStack

2019-06-16 Thread zhijiang
/browse/FLINK-10462 Best, Zhijiang -- From:aitozi Send Time:2019年6月16日(星期日) 22:19 To:user Subject:A little doubt about the blog A Deep To Flink's NetworkStack Hi, community I read this blog A Deep To Flink's NetworkSt

Re: Maybe a flink bug. Job keeps in FAILING state

2019-06-19 Thread zhijiang
As long as one task is in canceling state, then the job status might be still in canceling state. @Joshua Do you confirm all of the tasks in topology were already in terminal state such as failed or canceled? Best, Zhijiang

Re: Maybe a flink bug. Job keeps in FAILING state

2019-06-21 Thread zhijiang
to check the task final state. Best, Zhijiang -- From:Joshua Fan Send Time:2019年6月20日(星期四) 11:55 To:zhijiang Cc:user ; Till Rohrmann ; Chesnay Schepler Subject:Re: Maybe a flink bug. Job keeps in FAILING state zhijiang I did not

Re: Maybe a flink bug. Job keeps in FAILING state

2019-06-21 Thread zhijiang
exit to solve the potential issue. Best, Zhijiang -- From:Chesnay Schepler Send Time:2019年6月21日(星期五) 16:34 To:zhijiang ; Joshua Fan Cc:user ; Till Rohrmann Subject:Re: Maybe a flink bug. Job keeps in FAILING state The logs are at

Re: Maybe a flink bug. Job keeps in FAILING state

2019-06-24 Thread zhijiang
final decision. Best, Zhijiang -- From:Joshua Fan Send Time:2019年6月25日(星期二) 11:10 To:zhijiang Cc:Chesnay Schepler ; user ; Till Rohrmann Subject:Re: Maybe a flink bug. Job keeps in FAILING state Hi Zhijiang Thank you for your

Re: [ANNOUNCE] Rong Rong becomes a Flink committer

2019-07-11 Thread zhijiang
Congratulations Rong! Best, Zhijiang -- From:Kurt Young Send Time:2019年7月11日(星期四) 22:54 To:Kostas Kloudas Cc:Jark Wu ; Fabian Hueske ; dev ; user Subject:Re: [ANNOUNCE] Rong Rong becomes a Flink committer Congratulations Rong

回复:DataSet with Multiple reduce Actions

2018-06-27 Thread Zhijiang(wangzhijiang999)
Hi Osh, As I know, currently one dataset source can not be consumed by several different vertexs and from the API you can not construct the topology for your request. I think your way to merge different reduce functions into one UDF is feasible. Maybe someone has better solution. :) zhijiang

回复:Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-02 Thread Zhijiang(wangzhijiang999)
whether and where caused the OOM. Maybe check the task failure logs. Zhijiang -- 发件人:gerardg 发送时间:2018年6月30日(星期六) 00:12 收件人:user 主 题:Re: Flink job hangs/deadlocks (possibly related to out of memory) (fixed formatting) Hello

回复:Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-02 Thread Zhijiang(wangzhijiang999)
to trigger restarting the job. Zhijiang -- 发件人:Gerard Garcia 发送时间:2018年7月2日(星期一) 18:29 收件人:wangzhijiang999 抄 送:user 主 题:Re: Flink job hangs/deadlocks (possibly related to out of memory) Thanks Zhijiang, We haven't found any

回复:Limiting in flight data

2018-07-05 Thread Zhijiang(wangzhijiang999)
taskmanager.network.memory.floating-buffers-per-gate. If you have other questions about them, let me know then i can explain for you. Zhijiang -- 发件人:Vishal Santoshi 发送时间:2018年7月5日(星期四) 22:28 收件人:user 主 题:Limiting in flight data "Yes,

回复:Handling back pressure in Flink.

2018-07-05 Thread Zhijiang(wangzhijiang999)
not cause OOM). I think you should not worry about that. Normally it is better to consider TPS of both sides and set the proper paralellism to avoid back pressure to some extent. Zhijiang -- 发件人:Mich Talebzadeh 发送时间:2018年7月4日(星期三

回复:Limiting in flight data

2018-07-08 Thread Zhijiang(wangzhijiang999)
trics for some helps. -- 发件人:Vishal Santoshi 发送时间:2018年7月6日(星期五) 22:05 收件人:Zhijiang(wangzhijiang999) 抄 送:user 主 题:Re: Limiting in flight data Further if there is are metrics that allows us to chart delays per pipe on n/w buffers, that would be immensely help

回复:Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-13 Thread Zhijiang(wangzhijiang999)
framework. Also you can monitor the gc status to check the full gc delay. Best, Zhijiang -- 发件人:Gerard Garcia 发送时间:2018年7月13日(星期五) 16:22 收件人:wangzhijiang999 抄 送:user 主 题:Re: Flink job hangs/deadlocks (possibly related to out of m

回复:Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-17 Thread Zhijiang(wangzhijiang999)
for lock which is also occupied by task output process. As you mentioned, it makes sense to check the data structure of the output record and reduces the size or make it lightweight to handle. Best, Zhijiang -- 发件人:Gerard Garcia

回复:Network PartitionNotFoundException when run on multi nodes

2018-07-22 Thread Zhijiang(wangzhijiang999)
askManager received the task deployment delayed from JobManager, or some operations in upstream task initialization unexpectly cost more time before registering result partition. Best, Zhijiang -- 发件人:Steffen Wohlers 发送时间:2018年7月22日(星期日)

回复:Kryo Serialization Issue

2018-08-28 Thread Zhijiang(wangzhijiang999)
buffers in record serializers. If the record size is large and the downstream parallelism is large, it may cause OOM issue in serialization. Could you show the stack of OOM part? If it is this case, the following [1] can solve it and it is working in progress. Zhijiang [1] https

回复:Backpressure? for Batches

2018-08-29 Thread Zhijiang(wangzhijiang999)
I remember, that means the downstream will be scheduled after upstream finishes, so the slower downstream will not block upstream running, then the backpressure may not exist in this case. Best, Zhijiang -- 发件人:Darshan Singh 发送时间

回复:Backpressure? for Batches

2018-08-29 Thread Zhijiang(wangzhijiang999)
(groupby in your case) or decrease the parallelism of fast node(source in your case). Best, Zhijiang -- 发件人:Darshan Singh 发送时间:2018年8月29日(星期三) 18:16 收件人:chesnay 抄 送:wangzhijiang999 ; user 主 题:Re: Backpressure? for Batches Thanks, Now

回复:Backpressure? for Batches

2018-08-29 Thread Zhijiang(wangzhijiang999)
You can check the log to show the related stack in OOM, maybe we can confirm some reasons. Or you can dump the heap to analyze the memory usages after OOM. Best, Zhijiang -- 发件人:Darshan Singh 发送时间:2018年8月29日(星期三) 19:22 收件人

回复:Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed.

2018-09-07 Thread Zhijiang(wangzhijiang999)
, Zhijiang -- 发件人:杨力 发送时间:2018年9月7日(星期五) 13:09 收件人:user 主 题:Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed. Hi all, I am encountering a weird problem when running flink 1.6 in yarn per-job clusters. The job

  1   2   >