Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-27 Thread Gen Luo
; approach > > > > > >> to harden the contract of the interface. +1 for that proposal > > > > > >> > > > > > >> - async operation: I think David is right. An async interface > makes > > > > the > > > > > >> listene

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-21 Thread Gen Luo
Hi Panagiotis, Thanks for the proposal. It's useful to enrich the information so that users can be more clear why the job is failing, especially platform developers who need to provide the information to their end users. And for the very FLIP, I'd prefer the naming `FailureEnricher` proposed by D

Re: [DISCUSS]Introduce a time-segment based restart strategy

2022-11-25 Thread Gen Luo
Hi all, Sorry for the late jumping in. To meet Weihua's need, Dong's proposal seems pretty fine, but the modification it requires, I'm afraid, is not really easy. RestartBackoffTimeStrategy is quite a simple interface. The strategy even doesn't know which task is failing, not to mention the divis

[jira] [Created] (FLINK-29927) AkkaUtils#getAddress may cause memory leak

2022-11-08 Thread Gen Luo (Jira)
Gen Luo created FLINK-29927: --- Summary: AkkaUtils#getAddress may cause memory leak Key: FLINK-29927 URL: https://issues.apache.org/jira/browse/FLINK-29927 Project: Flink Issue Type: Bug

[jira] [Created] (FLINK-28864) DynamicPartitionPruningRule#isNewSource should check if the source used by the DataStreamScanProvider is actually a new sourc

2022-08-08 Thread Gen Luo (Jira)
Gen Luo created FLINK-28864: --- Summary: DynamicPartitionPruningRule#isNewSource should check if the source used by the DataStreamScanProvider is actually a new sourc Key: FLINK-28864 URL: https://issues.apache.org/jira

Re: [DISCUSS] Replace Attempt column with Attempt Number on the subtask list page of the Web UI

2022-07-20 Thread Gen Luo
id that is > different from the corresponding attempt number in REST, metrics and logs. > It adds burden to users to do the mapping in troubleshooting. Mis-mapping > can be easy to happen and result in a waste of efforts and wrong > conclusion. > > Therefore, +1 for this proposal.

[DISCUSS] Replace Attempt column with Attempt Number on the subtask list page of the Web UI

2022-07-20 Thread Gen Luo
Hi everyone, I'd like to propose a change on the Web UI to replace the Attempt column with an Attempt Number column on the subtask list page. >From the very beginning, the attempt number shown is calculated at the frontend by subtask.attempt + 1, which means the attempt number shown on the web UI

[jira] [Created] (FLINK-28589) Enhance Web UI for Speculative Execution

2022-07-18 Thread Gen Luo (Jira)
Gen Luo created FLINK-28589: --- Summary: Enhance Web UI for Speculative Execution Key: FLINK-28589 URL: https://issues.apache.org/jira/browse/FLINK-28589 Project: Flink Issue Type: Sub-task

[jira] [Created] (FLINK-28588) Enhance REST API for Speculative Execution

2022-07-18 Thread Gen Luo (Jira)
Gen Luo created FLINK-28588: --- Summary: Enhance REST API for Speculative Execution Key: FLINK-28588 URL: https://issues.apache.org/jira/browse/FLINK-28588 Project: Flink Issue Type: Sub-task

[jira] [Created] (FLINK-28587) FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-18 Thread Gen Luo (Jira)
Gen Luo created FLINK-28587: --- Summary: FLIP-249: Flink Web UI Enhancement for Speculative Execution Key: FLINK-28587 URL: https://issues.apache.org/jira/browse/FLINK-28587 Project: Flink Issue

[RESULT][VOTE] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-15 Thread Gen Luo
Hi everyone, I’m happy to announce that FLIP-249[1] has been accepted, with 4 approving votes, 3 of which are binding[2]: - Zhu Zhu (binding) - Lijie Wang - Jing Zhang (binding) - Yun Gao (binding) There is no disapproving vote. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-249%3A+F

Re: [DISCUSS] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-13 Thread Gen Luo
a more complicated operation to check all vertex and all subtasks list > page. > It's better to have an easier way to know whether the job contains > speculative executions > even after the job finished. > Maybe the point could be took into consideration in the next versio

Re: [VOTE] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-12 Thread Gen Luo
Hi Jing, I have replied in the discussion thread about the questions. Hope that would be helpful. Best, Gen On Tue, Jul 12, 2022 at 8:43 PM Jing Zhang wrote: > Hi, Gen Luo, > > I left two minor questions in the DISCUSS thread. > Sorry for jumping into the discussion so late. >

Re: [DISCUSS] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-12 Thread Gen Luo
? > 2. How to know whether the job contains speculative execution instances > after the job finished? > Do we have to check each subtasks of all vertex one by one? > > Best, > Jing Zhang > > Gen Luo 于2022年7月11日周一 22:31写道: > > > Hi, everyone. > > > > Than

[VOTE] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-11 Thread Gen Luo
Hi everyone, Thanks for all the feedback so far. Based on the discussion [1], we seem to have consensus. So, I would like to start a vote on FLIP-249 [2]. The vote will last for at least 72 hours unless there is an objection or insufficient votes. [1] https://lists.apache.org/thread/832tk3zvy

Re: [DISCUSS] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-11 Thread Gen Luo
Hi, everyone. Thanks for your feedback. If there are no more concerns or comments, I will start the vote tomorrow. Gen Luo 于 2022年7月11日周一 11:12写道: > Hi Lijie and Zhu, > > Thanks for the suggestion. I agree that the name "Blocked Free Slots" is > more clear to users. >

Re: [DISCUSS] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-10 Thread Gen Luo
unblocked, I think it's OK to call this state "available". > > 2. free and blocked, I think it's not appropriate to call "blocked" > > directly, because "blocked" should include both the "free and blocked" > and > > &quo

Re: [DISCUSS] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-08 Thread Gen Luo
mpts. The set will have one only element in > non-speculative cases though. In this way, we can have a unified > processing for ArchivedExecutionVertex in speculative/non-speculative > cases. > > Thanks, > Zhu > > Gen Luo 于2022年7月5日周二 15:10写道: > > > > > Hi every

[DISCUSS] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-05 Thread Gen Luo
Hi everyone, The speculative execution for batch jobs has been proposed and accepted in FLIP-168[1], as well as the related blocklist mechanism in FLIP-224[2]. As a follow-up step, the Flink Web UI needs to be enhanced to display the related information if the speculative execution mechanism is en

[jira] [Created] (FLINK-28240) NettyShuffleMetricFactory#RequestedMemoryUsageMetric#getValue may throw ArithmeticException when the total segments of NetworkBufferPool is 0

2022-06-24 Thread Gen Luo (Jira)
Gen Luo created FLINK-28240: --- Summary: NettyShuffleMetricFactory#RequestedMemoryUsageMetric#getValue may throw ArithmeticException when the total segments of NetworkBufferPool is 0 Key: FLINK-28240 URL: https

Re: [VOTE] FLIP-232: Add Retry Support For Async I/O In DataStream API

2022-05-30 Thread Gen Luo
+1 (non-binding) On Mon, May 30, 2022 at 3:50 PM Jark Wu wrote: > +1 (binding) > > Best, > Jark > > On Mon, 30 May 2022 at 15:40, Lincoln Lee wrote: > > > Dear Flink developers, > > > > Thanks for all your feedback for FLIP-232: Add Retry Support For Async > I/O > > In DataStream API[1] on the

Re: [DISCUSS] FLIP-232: Add Retry Support For Async I/O In DataStream API

2022-05-23 Thread Gen Luo
周二 10:20写道: > Thanks Gen Luo! > > Agree with you that prefer the simpler design. > > I’d like to share my thoughts on this choice: whether store the retry state > or not only affect the recovery logic, not the per-record processing, so I > just compare the two: > 1. w/ retry state:

Re: [DISCUSS] FLIP-232: Add Retry Support For Async I/O In DataStream API

2022-05-23 Thread Gen Luo
ution, in which the retry state is still possible to add if the need really arises in the future, but I respect your decision. 2. I think adding a currentAttempts parameter to the method is good enough. Lincoln Lee 于 2022年5月23日周一 14:52写道: > Hi Gen Luo, > Thanks a lot for your feed

Re: [DISCUSS] FLIP-232: Add Retry Support For Async I/O In DataStream API

2022-05-22 Thread Gen Luo
Thank Lincoln for the proposal! The FLIP looks good to me. I'm in favor of the timer based implementation, and I'd like to share some thoughts. I'm thinking if we have to store the retry status in the state. I suppose the retrying requests can just submit as the first attempt when the job restore

[jira] [Created] (FLINK-26610) FileSink can not upgrade from 1.13 if the uid of the origin sink is not set.

2022-03-11 Thread Gen Luo (Jira)
Gen Luo created FLINK-26610: --- Summary: FileSink can not upgrade from 1.13 if the uid of the origin sink is not set. Key: FLINK-26610 URL: https://issues.apache.org/jira/browse/FLINK-26610 Project: Flink

[jira] [Created] (FLINK-26580) FileSink CompactCoordinator add illegal committable as toCompacted.

2022-03-10 Thread Gen Luo (Jira)
Gen Luo created FLINK-26580: --- Summary: FileSink CompactCoordinator add illegal committable as toCompacted. Key: FLINK-26580 URL: https://issues.apache.org/jira/browse/FLINK-26580 Project: Flink

[jira] [Created] (FLINK-26564) CompactCoordinatorStateHandler doesn't properly handle the cleanup-in-progress requests.

2022-03-09 Thread Gen Luo (Jira)
Gen Luo created FLINK-26564: --- Summary: CompactCoordinatorStateHandler doesn't properly handle the cleanup-in-progress requests. Key: FLINK-26564 URL: https://issues.apache.org/jira/browse/FLINK-

[jira] [Created] (FLINK-26440) CompactorOperatorStateHandler can not work with unaligned checkpoint

2022-03-01 Thread Gen Luo (Jira)
Gen Luo created FLINK-26440: --- Summary: CompactorOperatorStateHandler can not work with unaligned checkpoint Key: FLINK-26440 URL: https://issues.apache.org/jira/browse/FLINK-26440 Project: Flink

[jira] [Created] (FLINK-26394) CheckpointCoordinator.isTriggering can not be reset if a checkpoint expires while the checkpointCoordinator task is queuing in the SourceCoordinator executor.

2022-02-28 Thread Gen Luo (Jira)
Gen Luo created FLINK-26394: --- Summary: CheckpointCoordinator.isTriggering can not be reset if a checkpoint expires while the checkpointCoordinator task is queuing in the SourceCoordinator executor. Key: FLINK-26394

[jira] [Created] (FLINK-26235) CompactingFileWriter and PendingFileRecoverable should not be exposed to users.

2022-02-17 Thread Gen Luo (Jira)
Gen Luo created FLINK-26235: --- Summary: CompactingFileWriter and PendingFileRecoverable should not be exposed to users. Key: FLINK-26235 URL: https://issues.apache.org/jira/browse/FLINK-26235 Project: Flink

[jira] [Created] (FLINK-26180) Update docs to introduce the compaction for FileSink

2022-02-16 Thread Gen Luo (Jira)
Gen Luo created FLINK-26180: --- Summary: Update docs to introduce the compaction for FileSink Key: FLINK-26180 URL: https://issues.apache.org/jira/browse/FLINK-26180 Project: Flink Issue Type: Sub

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-08 Thread Gen Luo
quite. >> >> Best, >> Piotrek >> >> wt., 8 lut 2022 o 11:44 Chesnay Schepler napisał(a): >> >> > Could someone expand on these operational issues you're facing when >> > achieving this via separate jobs? >> > >> > I feel

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-08 Thread Gen Luo
t; "it should be no different than no data processed between CP8 and CP10" > > 2. I've noticed that from this question there is a gap between > "*allow aborted/failed checkpoint in independent sub-graph*" and > my intention: "*independent sub-graph checkpointin

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-07 Thread Gen Luo
points independently? > > > > Conceptually, I somehow think that a pipelined region that is failed > and > > > > cannot create a new checkpoint is more or less the same as a > pipelined > > > > region that didn't get new input or a very very slow pipel

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-07 Thread Gen Luo
o cases. > > Cheers, > Till > > On Mon, Feb 7, 2022 at 8:55 AM Gen Luo wrote: > > > Hi Gyula, > > > > Thanks for sharing the idea. As Yuan mentioned, I think we can discuss > this > > within two scopes. One is the job subgraph, the other is the execution

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-06 Thread Gen Luo
Hi Gyula, Thanks for sharing the idea. As Yuan mentioned, I think we can discuss this within two scopes. One is the job subgraph, the other is the execution subgraph, which I suppose is the same as PipelineRegion. An idea is to individually checkpoint the PipelineRegions, for the recovering in a

Re: [VOTE] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-20 Thread Gen Luo
+1 (non-binding) On Thu, Jan 20, 2022 at 3:26 PM Yun Gao wrote: > +1 (binding) > > Thanks Xuannan for driving this! > > Best, > Yun > > > -- > From:David Morávek > Send Time:2022 Jan. 20 (Thu.) 15:12 > To:dev > Subject:Re: [VOTE]

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-06 Thread Gen Luo
in the last hour. The > > Flink application consists of multiple batch jobs and the batch jobs > > share some intermediate results, so users can use cache to avoid > > re-computation. The intermediate result is not meaningful outside of > > the application. And the cache

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2021-12-30 Thread Gen Luo
Hi Xuannan, I found FLIP-188[1] that is aiming to introduce a built-in dynamic table storage, which provides a unified changelog & table representation. Tables stored there can be used in further ad-hoc queries. To my understanding, it's quite like an implementation of caching in Table API, and th

[jira] [Created] (FLINK-24965) Improper usage of Map.Entry after Entry Iterator.remove in TaskLocaStateStoreImpl#pruneCheckpoints

2021-11-19 Thread Gen Luo (Jira)
Gen Luo created FLINK-24965: --- Summary: Improper usage of Map.Entry after Entry Iterator.remove in TaskLocaStateStoreImpl#pruneCheckpoints Key: FLINK-24965 URL: https://issues.apache.org/jira/browse/FLINK-24965

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Gen Luo
> Hbase). If most of those calls are very fast, sometimes when the system >> is >> > under heavy load they may block more than a few seconds, and having our >> app >> > killed because of a short timeout is not an option. >> > >> > >> > >

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-21 Thread Gen Luo
Hi, Thanks for driving this @Till Rohrmann . I would give +1 on reducing the heartbeat timeout and interval, though I'm not sure if 15s and 3s would be enough either. IMO, except for the standalone cluster, where the heartbeat mechanism in Flink is totally relied, reducing the heartbeat can also

Re: Job Recovery Time on TM Lost

2021-07-08 Thread Gen Luo
In our inner >> flink version, we optimize it by task's report and jobmaster's probe. When >> a task fails because of the connection, it reports to the jobmaster. The >> jobmaster will try to confirm the liveness of the unconnected >> taskmanager for certain times by con

Re: Job Recovery Time on TM Lost

2021-07-06 Thread Gen Luo
one could say that one can disable reacting to a failed heartbeat RPC as it > is currently the case. > > We currently have a discussion about this on this PR [1]. Maybe you wanna > join the discussion there and share your insights. > > [1] https://github.com/apache/flink/pull/16357 &

Re: Job Recovery Time on TM Lost

2021-07-05 Thread Gen Luo
ailability of the remote system (e.g. a couple of lost heartbeat > messages). > > Cheers, > Till > > On Mon, Jul 5, 2021 at 10:00 AM Gen Luo wrote: > >> As far as I know, a TM will report connection failure once its connected >> TM is lost. I suppose JM can believe the

Re: Job Recovery Time on TM Lost

2021-07-05 Thread Gen Luo
TaskExecutors as fast as the heartbeat interval. > > [1] https://issues.apache.org/jira/browse/FLINK-23209 > > Cheers, > Till > > On Fri, Jul 2, 2021 at 9:33 AM Gen Luo wrote: > >> Thanks for sharing, Till and Yang. >> >> @Lu >> Sorry but I don't know

Re: Job Recovery Time on TM Lost

2021-07-02 Thread Gen Luo
least `heartbeat.timeout` time before the system recovers. Even >>>>>> if >>>>>> the cancellation happens fast (e.g. by having configured a low >>>>>> akka.ask.timeout), then Flink will still try to deploy tasks onto the >>&g

[jira] [Created] (FLINK-23216) RM keeps allocating and freeing slots after a TM lost until its heartbeat timeout

2021-07-01 Thread Gen Luo (Jira)
Gen Luo created FLINK-23216: --- Summary: RM keeps allocating and freeing slots after a TM lost until its heartbeat timeout Key: FLINK-23216 URL: https://issues.apache.org/jira/browse/FLINK-23216 Project

Re: [DISCUSS] FLIP-39: Flink ML pipeline and ML libs

2019-06-17 Thread Gen Luo
g on the > > > replies, we will decide if we shall delete it in Flink1.9 or > > > deprecate&delete in the next release after 1.9. > > > > > > [1] > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usag

Re: [DISCUSS] FLIP-39: Flink ML pipeline and ML libs

2019-05-21 Thread Gen Luo
k-table-planner. > > I hope I summarised our discussion correctly. > > > On 17. May 2019, at 12:20, Gen Luo wrote: > > > > Thanks for your reply. > > > > For the first question, it's not strictly necessary. But I perfer not to > > have a TableEnvironment arg

Re: [DISCUSS] FLIP-39: Flink ML pipeline and ML libs

2019-05-17 Thread Gen Luo
Thanks for your reply. For the first question, it's not strictly necessary. But I perfer not to have a TableEnvironment argument in Estimator.fit() or Transformer.transform(), which is not part of machine learning concept, and may make our API not as clean and pretty as other systems do. I would l

Re: [DISCUSS] FLIP-39: Flink ML pipeline and ML libs

2019-05-17 Thread Gen Luo
It's better not to depend on flink-table-planner indeed. It's currently needed for 3 points: registering udagg, judging the tableEnv batch or streaming, converting table to dataSet to collect data. Most of these requirements can be fulfilled by flink-table-api-java-bridge and flink-table-api-scala-

Apply for contributor permission

2019-05-09 Thread Gen Luo
Hi, Could someone add me as a contributor? My JIRA username is c4e. Thanks!