[jira] [Created] (FLINK-28103) Job cancelling api returns 404 when job is actually running
David created FLINK-28103: - Summary: Job cancelling api returns 404 when job is actually running Key: FLINK-28103 URL: https://issues.apache.org/jira/browse/FLINK-28103 Project: Flink Issue Type: Bug Components: Runtime / REST Affects Versions: 1.13.5 Reporter: David Attachments: image-2022-06-17-14-04-44-307.png, image-2022-06-17-14-05-36-264.png The job is still running: !image-2022-06-17-14-04-44-307.png! but the cancelling api returns 404: !image-2022-06-17-14-05-36-264.png! -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-28103) Job cancelling api returns 404 when job is actually running
[ https://issues.apache.org/jira/browse/FLINK-28103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555842#comment-17555842 ] David commented on FLINK-28103: --- [~martijnvisser] I'll try that next week > Job cancelling api returns 404 when job is actually running > --- > > Key: FLINK-28103 > URL: https://issues.apache.org/jira/browse/FLINK-28103 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.13.5 >Reporter: David >Priority: Major > Attachments: image-2022-06-17-14-04-44-307.png, > image-2022-06-17-14-05-36-264.png > > > The job is still running: > !image-2022-06-17-14-04-44-307.png! > > but the cancelling api returns 404: > !image-2022-06-17-14-05-36-264.png! -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-28103) Job cancelling api returns 404 when job is actually running
[ https://issues.apache.org/jira/browse/FLINK-28103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555843#comment-17555843 ] David commented on FLINK-28103: --- [~chesnay] PATCH is supported neither in Java (HttpURLConnection) nor in http tools such as Insomnia. The header X-HTTP-Method-Override is a workaround but seems not working with the flink rest service. > Job cancelling api returns 404 when job is actually running > --- > > Key: FLINK-28103 > URL: https://issues.apache.org/jira/browse/FLINK-28103 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.13.5 >Reporter: David >Priority: Major > Attachments: image-2022-06-17-14-04-44-307.png, > image-2022-06-17-14-05-36-264.png > > > The job is still running: > !image-2022-06-17-14-04-44-307.png! > > but the cancelling api returns 404: > !image-2022-06-17-14-05-36-264.png! -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (FLINK-37747) GlobalCommitterOperator cannot commit after scaling writer/committer
[ https://issues.apache.org/jira/browse/FLINK-37747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David updated FLINK-37747: -- Priority: Blocker (was: Major) > GlobalCommitterOperator cannot commit after scaling writer/committer > > > Key: FLINK-37747 > URL: https://issues.apache.org/jira/browse/FLINK-37747 > Project: Flink > Issue Type: Bug >Reporter: David >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-37747) GlobalCommitterOperator cannot commit after scaling writer/committer
David created FLINK-37747: - Summary: GlobalCommitterOperator cannot commit after scaling writer/committer Key: FLINK-37747 URL: https://issues.apache.org/jira/browse/FLINK-37747 Project: Flink Issue Type: Bug Reporter: David -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-37747) GlobalCommitterOperator cannot commit after scaling writer/committer
[ https://issues.apache.org/jira/browse/FLINK-37747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David updated FLINK-37747: -- Description: Hey, Our FLINK job stopped writing into Delta table with FLINK Delta connector frequently. After checking the issue, we found in GlobalCommitterOperator, in [commit|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L207] function, it was returned directly when checking some checkpoint has finished or not(this [code|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L211]). The issue was happened when: * auto-scaler scales up chained writer/committer(the direct upstream operator of GlobalCommitterOperator) * job ran limited TM first with lower parallelism for writer/committer, and then writer/committer was scaled up to higher parallelism After debugging with more logs, we found the cause of the issue. An example is: * for checkpoint 3, FLINK job completed successfully with 3 writer/committer in parallel ** All committable objects in writer/committer were saved into checkpoint state in checkpoint 3 * writer/committer was scaled up to 5 parallel tasks * writer/committer restore state from checkpoint 3, they will emit committable objects from checkpoint 3. code is [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L139] ** latest parallelism of writer/committer is used, which is 5 in CommittableSummary. Code is [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L186] * GlobalCommitterOpeartor received committable summary from writer/committer, it knows: ** 5 parallel writer/committer from upstreams ** it will look for committable summary from 5 upstream writer/committer * 3 writer/committers emit CommittableSummary to global committer operator as only 3 restore state from checkpoint 3 * Global committer operator stuck here forever as it looks for committable summary for 5 subtasks from upstream operator We have a quick solution for this case and raise a PR to fix this. We are using FLINK 1.20 but we found the issue is still existed in master branch. > GlobalCommitterOperator cannot commit after scaling writer/committer > > > Key: FLINK-37747 > URL: https://issues.apache.org/jira/browse/FLINK-37747 > Project: Flink > Issue Type: Bug >Reporter: David >Priority: Blocker > > Hey, > Our FLINK job stopped writing into Delta table with FLINK Delta connector > frequently. After checking the issue, we found in GlobalCommitterOperator, in > [commit|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L207] > function, it was returned directly when checking some checkpoint has > finished or not(this > [code|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L211]). > The issue was happened when: > * auto-scaler scales up chained writer/committer(the direct upstream > operator of GlobalCommitterOperator) > * job ran limited TM first with lower parallelism for writer/committer, and > then writer/committer was scaled up to higher parallelism > After debugging with more logs, we found the cause of the issue. An example > is: > * for checkpoint 3, FLINK job completed successfully with 3 writer/committer > in parallel > ** All committable objects in writer/committer were saved into checkpoint > state in checkpoint 3 > * writer/committer was scaled up to 5 parallel tasks > * writer/committer restore state from checkpoint 3, they will emit > committable objects from checkpoint 3. code is > [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L139] > ** latest parallelism of writer/committer is used, which is 5 in > CommittableSummary. Code is > [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L186] > * GlobalCommitterOpeartor received committable summary from > writer/committer, it knows: > ** 5 parallel writer/committer from upstreams > ** it will look for committable summary from 5 upstream writer/committer > * 3 writer/committers emit CommittableSummary to global committer operator > as only 3 restore state from checkpoint 3 > * Global c
[jira] [Commented] (FLINK-37747) GlobalCommitterOperator cannot commit after scaling writer/committer
[ https://issues.apache.org/jira/browse/FLINK-37747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17949356#comment-17949356 ] David commented on FLINK-37747: --- Hey [~arvid] Thanks for reviewing the PR. I updated the PR accordingly. Do you think the PR is OK to merge? This issue is also in 1.20.1 release. Do you have guideline to apply the fix to 1.20.1 release? Thanks! > GlobalCommitterOperator cannot commit after scaling writer/committer > > > Key: FLINK-37747 > URL: https://issues.apache.org/jira/browse/FLINK-37747 > Project: Flink > Issue Type: Bug >Reporter: David >Assignee: David >Priority: Blocker > Labels: pull-request-available > > Hey, > Our FLINK job stopped writing into Delta table with FLINK Delta connector > frequently. After checking the issue, we found in GlobalCommitterOperator, in > [commit|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L207] > function, it was returned directly when checking some checkpoint has > finished or not(this > [code|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L211]). > The issue was happened when: > * auto-scaler scales up chained writer/committer(the direct upstream > operator of GlobalCommitterOperator) > * job ran limited TM first with lower parallelism for writer/committer, and > then writer/committer was scaled up to higher parallelism > After debugging with more logs, we found the cause of the issue. An example > is: > * for checkpoint 3, FLINK job completed successfully with 3 writer/committer > in parallel > ** All committable objects in writer/committer were saved into checkpoint > state in checkpoint 3 > * writer/committer was scaled up to 5 parallel tasks > * writer/committer restore state from checkpoint 3, they will emit > committable objects from checkpoint 3. code is > [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L139] > ** latest parallelism of writer/committer is used, which is 5 in > CommittableSummary. Code is > [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L186] > * GlobalCommitterOpeartor received committable summary from > writer/committer, it knows: > ** 5 parallel writer/committer from upstreams > ** it will look for committable summary from 5 upstream writer/committer > * 3 writer/committers emit CommittableSummary to global committer operator > as only 3 restore state from checkpoint 3 > * Global committer operator stuck here forever as it looks for committable > summary for 5 subtasks from upstream operator > We have a quick solution for this case and raise a PR to fix this. > We are using FLINK 1.20 but we found the issue is still existed in master > branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-37747) GlobalCommitterOperator cannot commit after scaling writer/committer
[ https://issues.apache.org/jira/browse/FLINK-37747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17948068#comment-17948068 ] David commented on FLINK-37747: --- Thanks [~arvid] . We are using FLINK 1.20.1. Could you give me an example of IT case? I can try to make one and update the PR. > GlobalCommitterOperator cannot commit after scaling writer/committer > > > Key: FLINK-37747 > URL: https://issues.apache.org/jira/browse/FLINK-37747 > Project: Flink > Issue Type: Bug >Reporter: David >Priority: Blocker > Labels: pull-request-available > > Hey, > Our FLINK job stopped writing into Delta table with FLINK Delta connector > frequently. After checking the issue, we found in GlobalCommitterOperator, in > [commit|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L207] > function, it was returned directly when checking some checkpoint has > finished or not(this > [code|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L211]). > The issue was happened when: > * auto-scaler scales up chained writer/committer(the direct upstream > operator of GlobalCommitterOperator) > * job ran limited TM first with lower parallelism for writer/committer, and > then writer/committer was scaled up to higher parallelism > After debugging with more logs, we found the cause of the issue. An example > is: > * for checkpoint 3, FLINK job completed successfully with 3 writer/committer > in parallel > ** All committable objects in writer/committer were saved into checkpoint > state in checkpoint 3 > * writer/committer was scaled up to 5 parallel tasks > * writer/committer restore state from checkpoint 3, they will emit > committable objects from checkpoint 3. code is > [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L139] > ** latest parallelism of writer/committer is used, which is 5 in > CommittableSummary. Code is > [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L186] > * GlobalCommitterOpeartor received committable summary from > writer/committer, it knows: > ** 5 parallel writer/committer from upstreams > ** it will look for committable summary from 5 upstream writer/committer > * 3 writer/committers emit CommittableSummary to global committer operator > as only 3 restore state from checkpoint 3 > * Global committer operator stuck here forever as it looks for committable > summary for 5 subtasks from upstream operator > We have a quick solution for this case and raise a PR to fix this. > We are using FLINK 1.20 but we found the issue is still existed in master > branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-37747) GlobalCommitterOperator cannot commit after scaling writer/committer
[ https://issues.apache.org/jira/browse/FLINK-37747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950839#comment-17950839 ] David commented on FLINK-37747: --- [~arvid] yep, I am making PRs to 1.20 and 1.19. > GlobalCommitterOperator cannot commit after scaling writer/committer > > > Key: FLINK-37747 > URL: https://issues.apache.org/jira/browse/FLINK-37747 > Project: Flink > Issue Type: Bug >Affects Versions: 2.0.0, 1.19.2, 1.20.1, 2.1.0 >Reporter: David >Assignee: David >Priority: Blocker > Labels: pull-request-available > Fix For: 2.1.0 > > > Hey, > Our FLINK job stopped writing into Delta table with FLINK Delta connector > frequently. After checking the issue, we found in GlobalCommitterOperator, in > [commit|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L207] > function, it was returned directly when checking some checkpoint has > finished or not(this > [code|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L211]). > The issue was happened when: > * auto-scaler scales up chained writer/committer(the direct upstream > operator of GlobalCommitterOperator) > * job ran limited TM first with lower parallelism for writer/committer, and > then writer/committer was scaled up to higher parallelism > After debugging with more logs, we found the cause of the issue. An example > is: > * for checkpoint 3, FLINK job completed successfully with 3 writer/committer > in parallel > ** All committable objects in writer/committer were saved into checkpoint > state in checkpoint 3 > * writer/committer was scaled up to 5 parallel tasks > * writer/committer restore state from checkpoint 3, they will emit > committable objects from checkpoint 3. code is > [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L139] > ** latest parallelism of writer/committer is used, which is 5 in > CommittableSummary. Code is > [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L186] > * GlobalCommitterOpeartor received committable summary from > writer/committer, it knows: > ** 5 parallel writer/committer from upstreams > ** it will look for committable summary from 5 upstream writer/committer > * 3 writer/committers emit CommittableSummary to global committer operator > as only 3 restore state from checkpoint 3 > * Global committer operator stuck here forever as it looks for committable > summary for 5 subtasks from upstream operator > We have a quick solution for this case and raise a PR to fix this. > We are using FLINK 1.20 but we found the issue is still existed in master > branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-37747) GlobalCommitterOperator cannot commit after scaling writer/committer
[ https://issues.apache.org/jira/browse/FLINK-37747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950842#comment-17950842 ] David commented on FLINK-37747: --- hey [~arvid] i made a PR to path release-2.0: [https://github.com/apache/flink/pull/26546] for release-1.20 and release-1.19, some extra commits are needed and I am not sure which commit log should be included. Can you help on this? Thank you! > GlobalCommitterOperator cannot commit after scaling writer/committer > > > Key: FLINK-37747 > URL: https://issues.apache.org/jira/browse/FLINK-37747 > Project: Flink > Issue Type: Bug >Affects Versions: 2.0.0, 1.19.2, 1.20.1, 2.1.0 >Reporter: David >Assignee: David >Priority: Blocker > Labels: pull-request-available > Fix For: 2.1.0 > > > Hey, > Our FLINK job stopped writing into Delta table with FLINK Delta connector > frequently. After checking the issue, we found in GlobalCommitterOperator, in > [commit|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L207] > function, it was returned directly when checking some checkpoint has > finished or not(this > [code|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L211]). > The issue was happened when: > * auto-scaler scales up chained writer/committer(the direct upstream > operator of GlobalCommitterOperator) > * job ran limited TM first with lower parallelism for writer/committer, and > then writer/committer was scaled up to higher parallelism > After debugging with more logs, we found the cause of the issue. An example > is: > * for checkpoint 3, FLINK job completed successfully with 3 writer/committer > in parallel > ** All committable objects in writer/committer were saved into checkpoint > state in checkpoint 3 > * writer/committer was scaled up to 5 parallel tasks > * writer/committer restore state from checkpoint 3, they will emit > committable objects from checkpoint 3. code is > [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L139] > ** latest parallelism of writer/committer is used, which is 5 in > CommittableSummary. Code is > [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L186] > * GlobalCommitterOpeartor received committable summary from > writer/committer, it knows: > ** 5 parallel writer/committer from upstreams > ** it will look for committable summary from 5 upstream writer/committer > * 3 writer/committers emit CommittableSummary to global committer operator > as only 3 restore state from checkpoint 3 > * Global committer operator stuck here forever as it looks for committable > summary for 5 subtasks from upstream operator > We have a quick solution for this case and raise a PR to fix this. > We are using FLINK 1.20 but we found the issue is still existed in master > branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-37747) GlobalCommitterOperator cannot commit after scaling writer/committer
[ https://issues.apache.org/jira/browse/FLINK-37747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950147#comment-17950147 ] David commented on FLINK-37747: --- Thanks [~arvid] for the review! Updated the PR :) > GlobalCommitterOperator cannot commit after scaling writer/committer > > > Key: FLINK-37747 > URL: https://issues.apache.org/jira/browse/FLINK-37747 > Project: Flink > Issue Type: Bug >Reporter: David >Assignee: David >Priority: Blocker > Labels: pull-request-available > > Hey, > Our FLINK job stopped writing into Delta table with FLINK Delta connector > frequently. After checking the issue, we found in GlobalCommitterOperator, in > [commit|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L207] > function, it was returned directly when checking some checkpoint has > finished or not(this > [code|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/GlobalCommitterOperator.java#L211]). > The issue was happened when: > * auto-scaler scales up chained writer/committer(the direct upstream > operator of GlobalCommitterOperator) > * job ran limited TM first with lower parallelism for writer/committer, and > then writer/committer was scaled up to higher parallelism > After debugging with more logs, we found the cause of the issue. An example > is: > * for checkpoint 3, FLINK job completed successfully with 3 writer/committer > in parallel > ** All committable objects in writer/committer were saved into checkpoint > state in checkpoint 3 > * writer/committer was scaled up to 5 parallel tasks > * writer/committer restore state from checkpoint 3, they will emit > committable objects from checkpoint 3. code is > [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L139] > ** latest parallelism of writer/committer is used, which is 5 in > CommittableSummary. Code is > [here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/sink/CommitterOperator.java#L186] > * GlobalCommitterOpeartor received committable summary from > writer/committer, it knows: > ** 5 parallel writer/committer from upstreams > ** it will look for committable summary from 5 upstream writer/committer > * 3 writer/committers emit CommittableSummary to global committer operator > as only 3 restore state from checkpoint 3 > * Global committer operator stuck here forever as it looks for committable > summary for 5 subtasks from upstream operator > We have a quick solution for this case and raise a PR to fix this. > We are using FLINK 1.20 but we found the issue is still existed in master > branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)