[jira] [Commented] (FLINK-13417) Bump Zookeeper to 3.5.5
[ https://issues.apache.org/jira/browse/FLINK-13417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938381#comment-16938381 ] Zili Chen commented on FLINK-13417: --- Thanks for your information [~trohrmann]. FYI I open a testing [pull request|https://github.com/apache/flink/pull/9762] for shading and bumping ZK. I'll really appreciate it if you can help with early review. > Bump Zookeeper to 3.5.5 > --- > > Key: FLINK-13417 > URL: https://issues.apache.org/jira/browse/FLINK-13417 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Konstantin Knauf >Priority: Major > Fix For: 1.10.0 > > > User might want to secure their Zookeeper connection via SSL. > This requires a Zookeeper version >= 3.5.1. We might as well try to bump it > to 3.5.5, which is the latest version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14237) No need to rename shipped Flink jar
Zili Chen created FLINK-14237: - Summary: No need to rename shipped Flink jar Key: FLINK-14237 URL: https://issues.apache.org/jira/browse/FLINK-14237 Project: Flink Issue Type: Improvement Components: Deployment / YARN Reporter: Zili Chen Assignee: Zili Chen Fix For: 1.10.0 Currently, when we ship Flink jar configured by -yj, we always rename it as {{flink.jar}}. It seems a redundant operation since we can always use the exact name of the real jar. It also cause some confusion to our users who should not be required to know the detail of Flink that they configure a specific Flink jar(said {{flink-private-version-suffix.jar}}) but cannot find it on YARN container, because it is now {{flink.jar}}. CC [~trohrmann] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14237) No need to rename shipped Flink jar
[ https://issues.apache.org/jira/browse/FLINK-14237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-14237: -- Description: Currently, when we ship Flink jar configured by -yj, we always rename it as {{flink.jar}}. It seems a redundant operation since we can always use the exact name of the real jar. It also causes some confusion to our users who should not be required to know about Flink internal implementation that they configure a specific Flink jar(said {{flink-private-version-suffix.jar}}) but cannot find it on YARN container, because it is now {{flink.jar}}. CC [~trohrmann] was: Currently, when we ship Flink jar configured by -yj, we always rename it as {{flink.jar}}. It seems a redundant operation since we can always use the exact name of the real jar. It also cause some confusion to our users who should not be required to know the detail of Flink that they configure a specific Flink jar(said {{flink-private-version-suffix.jar}}) but cannot find it on YARN container, because it is now {{flink.jar}}. CC [~trohrmann] > No need to rename shipped Flink jar > --- > > Key: FLINK-14237 > URL: https://issues.apache.org/jira/browse/FLINK-14237 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Zili Chen >Assignee: Zili Chen >Priority: Major > Fix For: 1.10.0 > > > Currently, when we ship Flink jar configured by -yj, we always rename it as > {{flink.jar}}. It seems a redundant operation since we can always use the > exact name of the real jar. It also causes some confusion to our users who > should not be required to know about Flink internal implementation that they > configure a specific Flink jar(said {{flink-private-version-suffix.jar}}) but > cannot find it on YARN container, because it is now {{flink.jar}}. > CC [~trohrmann] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13992) Refactor Optional parameter in InputGateWithMetrics#updateMetrics
[ https://issues.apache.org/jira/browse/FLINK-13992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939671#comment-16939671 ] Zili Chen commented on FLINK-13992: --- Thanks for your review [~azagrebin]! master via 3a49da62332d4cfaf0536a11fb7da8de11baa17e > Refactor Optional parameter in InputGateWithMetrics#updateMetrics > - > > Key: FLINK-13992 > URL: https://issues.apache.org/jira/browse/FLINK-13992 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: Zili Chen >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 20m > Remaining Estimate: 0h > > As consensus from community code style discussion, in > {{InputGateWithMetrics#updateMetrics}} we can refactor to reduce the usage of > Optional parameter. > cc [~azagrebin] > {code:java} > diff --git > a/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/InputGateWithMetrics.java > > b/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/InputGateWithMetrics.java > index 5d2cfd95c4..e548fbf02b 100644 > --- > a/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/InputGateWithMetrics.java > +++ > b/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/InputGateWithMetrics.java > @@ -24,6 +24,8 @@ import > org.apache.flink.runtime.io.network.partition.consumer.BufferOrEvent; > import org.apache.flink.runtime.io.network.partition.consumer.InputGate; > import org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup; > > +import javax.annotation.Nonnull; > + > import java.io.IOException; > import java.util.Optional; > import java.util.concurrent.CompletableFuture; > @@ -67,12 +69,12 @@ public class InputGateWithMetrics extends InputGate { > > @Override > public Optional getNext() throws IOException, > InterruptedException { > - return updateMetrics(inputGate.getNext()); > + return inputGate.getNext().map(this::updateMetrics); > } > > @Override > public Optional pollNext() throws IOException, > InterruptedException { > - return updateMetrics(inputGate.pollNext()); > + return inputGate.pollNext().map(this::updateMetrics); > } > > @Override > @@ -85,8 +87,8 @@ public class InputGateWithMetrics extends InputGate { > inputGate.close(); > } > > - private Optional updateMetrics(Optional > bufferOrEvent) { > - bufferOrEvent.ifPresent(b -> numBytesIn.inc(b.getSize())); > + private BufferOrEvent updateMetrics(@Nonnull BufferOrEvent > bufferOrEvent) { > + numBytesIn.inc(bufferOrEvent.getSize()); > return bufferOrEvent; > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-13992) Refactor Optional parameter in InputGateWithMetrics#updateMetrics
[ https://issues.apache.org/jira/browse/FLINK-13992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen closed FLINK-13992. - Resolution: Fixed > Refactor Optional parameter in InputGateWithMetrics#updateMetrics > - > > Key: FLINK-13992 > URL: https://issues.apache.org/jira/browse/FLINK-13992 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: Zili Chen >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 20m > Remaining Estimate: 0h > > As consensus from community code style discussion, in > {{InputGateWithMetrics#updateMetrics}} we can refactor to reduce the usage of > Optional parameter. > cc [~azagrebin] > {code:java} > diff --git > a/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/InputGateWithMetrics.java > > b/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/InputGateWithMetrics.java > index 5d2cfd95c4..e548fbf02b 100644 > --- > a/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/InputGateWithMetrics.java > +++ > b/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/InputGateWithMetrics.java > @@ -24,6 +24,8 @@ import > org.apache.flink.runtime.io.network.partition.consumer.BufferOrEvent; > import org.apache.flink.runtime.io.network.partition.consumer.InputGate; > import org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup; > > +import javax.annotation.Nonnull; > + > import java.io.IOException; > import java.util.Optional; > import java.util.concurrent.CompletableFuture; > @@ -67,12 +69,12 @@ public class InputGateWithMetrics extends InputGate { > > @Override > public Optional getNext() throws IOException, > InterruptedException { > - return updateMetrics(inputGate.getNext()); > + return inputGate.getNext().map(this::updateMetrics); > } > > @Override > public Optional pollNext() throws IOException, > InterruptedException { > - return updateMetrics(inputGate.pollNext()); > + return inputGate.pollNext().map(this::updateMetrics); > } > > @Override > @@ -85,8 +87,8 @@ public class InputGateWithMetrics extends InputGate { > inputGate.close(); > } > > - private Optional updateMetrics(Optional > bufferOrEvent) { > - bufferOrEvent.ifPresent(b -> numBytesIn.inc(b.getSize())); > + private BufferOrEvent updateMetrics(@Nonnull BufferOrEvent > bufferOrEvent) { > + numBytesIn.inc(bufferOrEvent.getSize()); > return bufferOrEvent; > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14252) Encapsulate Dispatcher services in DispatcherServices container
[ https://issues.apache.org/jira/browse/FLINK-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939711#comment-16939711 ] Zili Chen commented on FLINK-14252: --- Make sense to me. A more general issue related is that we can document the "pattern" how Flink wants to manage concepts of cluster components so that we don't loss context in the further and it can be fast referred to. If you approve I'm glad to draft an early version and start a discussion. The output could be a develop document in a proper location that can be looked up in our community. > Encapsulate Dispatcher services in DispatcherServices container > --- > > Key: FLINK-14252 > URL: https://issues.apache.org/jira/browse/FLINK-14252 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Minor > > The list of required dispatcher services has grown quite unwieldy. In order > to more easily pass them through the chain of factories, I suggest to create > a {{DispatcherServices}} container which contains all {{Dispatcher}} services. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14261) Add support for permanently fenced RpcEndpoints
[ https://issues.apache.org/jira/browse/FLINK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939723#comment-16939723 ] Zili Chen commented on FLINK-14261: --- Permanently Fenced RpcEndpoint could be a good intermediate abstraction. Since all Fenced RpcEndpoint participants leader election before(this is where the requirement of fencing tech comes from), once leader election always happens outside of RpcEndpoint we can even safely remove leader session id field because now message are fenced to different actors naturally. > Add support for permanently fenced RpcEndpoints > --- > > Key: FLINK-14261 > URL: https://issues.apache.org/jira/browse/FLINK-14261 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > If leader election happens outside of a {{RpcEndpoint}} we no longer need to > set the fencing token after the endpoint has been started. This can simplify > the internal functioning of the endpoint. In order to support this feature, I > propose to add a {{PermanentlyFencedRpcEndpoint}} which is started with a > fixed fencing token. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14263) Extract static method from ClusterClient to ClientUtils
Zili Chen created FLINK-14263: - Summary: Extract static method from ClusterClient to ClientUtils Key: FLINK-14263 URL: https://issues.apache.org/jira/browse/FLINK-14263 Project: Flink Issue Type: Sub-task Components: Client / Job Submission Reporter: Zili Chen Assignee: Zili Chen Fix For: 1.10.0 Including - {{getOptimizedPlan(Optimizer, PackagedProgram, int)}} - {{getOptimizedPlan(Optimizer, Plan, int)}} - {{getJobGraph(Configuration, FlinkPlan, List, List, SavepointRestoreSettings)}} It is towards an interface-ized {{ClusterClient}}. Although interface can technically have static method, they are actually out of its scope and {{getOptimizedPlan}} is somehow for legacy usage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-14130) Remove ClusterClient.run() methods
[ https://issues.apache.org/jira/browse/FLINK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen reassigned FLINK-14130: - Assignee: Zili Chen > Remove ClusterClient.run() methods > -- > > Key: FLINK-14130 > URL: https://issues.apache.org/jira/browse/FLINK-14130 > Project: Flink > Issue Type: Sub-task > Components: Client / Job Submission >Reporter: Aljoscha Krettek >Assignee: Zili Chen >Priority: Major > > {{ClusterClient}} is an internal interface of the {{flink-clients}} package. > It should only be concerned with submitting {{JobGraphs}} to a cluster, which > is what {{submitJob()}} does. > The {{run()}} methods are concerned with unpacking programs or job-with-jars > and at the end use {{submitJob()}} in some way, they should reside in some > other component. The only valid remaining run method is {{run(PackagedProgram > prog, int parallelism)}}, this could be in {{PackagedProgramUtils}}. The > other {{run()}} methods are actually only used in one test: > {{ClientTest.shouldSubmitToJobClient()}}. I don't think that test is valid > anymore, it evolved for a very long time and now doesn't test what it was > supposed to test once. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14130) Remove ClusterClient.run() methods
[ https://issues.apache.org/jira/browse/FLINK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939842#comment-16939842 ] Zili Chen commented on FLINK-14130: --- Base on FLINK-14263 it is obviously how we can exclude {{#run}} s from {{ClusterClient}}. I will assign the issue to me and submit a pull request. Feel free to review it and see if it fits the idea in your minds. I'm also glad to collaborate with your implementation if there is one. > Remove ClusterClient.run() methods > -- > > Key: FLINK-14130 > URL: https://issues.apache.org/jira/browse/FLINK-14130 > Project: Flink > Issue Type: Sub-task > Components: Client / Job Submission >Reporter: Aljoscha Krettek >Priority: Major > > {{ClusterClient}} is an internal interface of the {{flink-clients}} package. > It should only be concerned with submitting {{JobGraphs}} to a cluster, which > is what {{submitJob()}} does. > The {{run()}} methods are concerned with unpacking programs or job-with-jars > and at the end use {{submitJob()}} in some way, they should reside in some > other component. The only valid remaining run method is {{run(PackagedProgram > prog, int parallelism)}}, this could be in {{PackagedProgramUtils}}. The > other {{run()}} methods are actually only used in one test: > {{ClientTest.shouldSubmitToJobClient()}}. I don't think that test is valid > anymore, it evolved for a very long time and now doesn't test what it was > supposed to test once. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14250) Introduce separate "bin/flink runpy" and remove python support from "bin/flink run"
[ https://issues.apache.org/jira/browse/FLINK-14250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939847#comment-16939847 ] Zili Chen commented on FLINK-14250: --- Vote for this. I second that {{if (!programOptions.isPython())}} is part of headaches when I'm going throw {{CliFrontend}} code >_< > Introduce separate "bin/flink runpy" and remove python support from > "bin/flink run" > --- > > Key: FLINK-14250 > URL: https://issues.apache.org/jira/browse/FLINK-14250 > Project: Flink > Issue Type: Sub-task > Components: Client / Job Submission >Reporter: Aljoscha Krettek >Priority: Major > > Currently, "bin/flink run" supports both Java and Python programs and there > is quite some complexity in command line parsing and validation and the code > is spread across different classes. > I think if we had a separate "bin/flink runpy" then we could simplify the > parsing quite a bit and the usage help of each command would only show those > options that are relevant for the given use case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14250) Introduce separate "bin/flink runpy" and remove python support from "bin/flink run"
[ https://issues.apache.org/jira/browse/FLINK-14250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939848#comment-16939848 ] Zili Chen commented on FLINK-14250: --- Actually I am thinking of making {{CliFrontend}} the unique CLI frontend. This is based on the fact that we have a dedicated {{FlinkYarnSessionCli}} and with the abstraction of {{CustomCommandLine}} things grow a lot on complexity... If only we can have a unique entrypoint like {{SparkSubmit}} in Spark. > Introduce separate "bin/flink runpy" and remove python support from > "bin/flink run" > --- > > Key: FLINK-14250 > URL: https://issues.apache.org/jira/browse/FLINK-14250 > Project: Flink > Issue Type: Sub-task > Components: Client / Job Submission >Reporter: Aljoscha Krettek >Priority: Major > > Currently, "bin/flink run" supports both Java and Python programs and there > is quite some complexity in command line parsing and validation and the code > is spread across different classes. > I think if we had a separate "bin/flink runpy" then we could simplify the > parsing quite a bit and the usage help of each command would only show those > options that are relevant for the given use case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-14250) Introduce separate "bin/flink runpy" and remove python support from "bin/flink run"
[ https://issues.apache.org/jira/browse/FLINK-14250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939848#comment-16939848 ] Zili Chen edited comment on FLINK-14250 at 9/28/19 4:49 AM: Actually I am thinking of making {{CliFrontend}} the unique CLI frontend. This is based on the fact that we have a dedicated {{FlinkYarnSessionCli}} and with the abstraction of {{CustomCommandLine}} things grow a lot on complexity... If only we can have a unique entrypoint like {{SparkSubmit}} in Spark. FYI {{CustomCommandLine}} actually take some responsibility of configuration, while some of ongoing effort try to make {{Configuration}} the unique interface of configuration. was (Author: tison): Actually I am thinking of making {{CliFrontend}} the unique CLI frontend. This is based on the fact that we have a dedicated {{FlinkYarnSessionCli}} and with the abstraction of {{CustomCommandLine}} things grow a lot on complexity... If only we can have a unique entrypoint like {{SparkSubmit}} in Spark. > Introduce separate "bin/flink runpy" and remove python support from > "bin/flink run" > --- > > Key: FLINK-14250 > URL: https://issues.apache.org/jira/browse/FLINK-14250 > Project: Flink > Issue Type: Sub-task > Components: Client / Job Submission >Reporter: Aljoscha Krettek >Priority: Major > > Currently, "bin/flink run" supports both Java and Python programs and there > is quite some complexity in command line parsing and validation and the code > is spread across different classes. > I think if we had a separate "bin/flink runpy" then we could simplify the > parsing quite a bit and the usage help of each command would only show those > options that are relevant for the given use case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-14250) Introduce separate "bin/flink runpy" and remove python support from "bin/flink run"
[ https://issues.apache.org/jira/browse/FLINK-14250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939848#comment-16939848 ] Zili Chen edited comment on FLINK-14250 at 9/28/19 4:50 AM: Actually I am thinking of making {{CliFrontend}} the unique CLI frontend. This is based on the fact that we have a dedicated {{FlinkYarnSessionCli}} and with the abstraction of {{CustomCommandLine}} things grow a lot on complexity... If only we can have a unique entrypoint like {{SparkSubmit}} in Spark. FYI {{CustomCommandLine}} also takes some responsibilities of configuration, while some of ongoing effort try to make {{Configuration}} the unique interface of configuration. was (Author: tison): Actually I am thinking of making {{CliFrontend}} the unique CLI frontend. This is based on the fact that we have a dedicated {{FlinkYarnSessionCli}} and with the abstraction of {{CustomCommandLine}} things grow a lot on complexity... If only we can have a unique entrypoint like {{SparkSubmit}} in Spark. FYI {{CustomCommandLine}} actually take some responsibility of configuration, while some of ongoing effort try to make {{Configuration}} the unique interface of configuration. > Introduce separate "bin/flink runpy" and remove python support from > "bin/flink run" > --- > > Key: FLINK-14250 > URL: https://issues.apache.org/jira/browse/FLINK-14250 > Project: Flink > Issue Type: Sub-task > Components: Client / Job Submission >Reporter: Aljoscha Krettek >Priority: Major > > Currently, "bin/flink run" supports both Java and Python programs and there > is quite some complexity in command line parsing and validation and the code > is spread across different classes. > I think if we had a separate "bin/flink runpy" then we could simplify the > parsing quite a bit and the usage help of each command would only show those > options that are relevant for the given use case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14050) Refactor YarnClusterDescriptor inheritance
[ https://issues.apache.org/jira/browse/FLINK-14050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-14050: -- Description: Currently, the inheritance looks like AbstractYarnClusterDescriptor -> YarnClusterDescriptor -> TestingYarnClusterDescriptor -> NonDeployingYarnClusterDescriptor -> -> NonDeployingDetachedYarnClusterDescriptor With an investigation, I find 1. AbstractYarnClusterDescriptor is introduced for migration purpose and no need any more. 2. TestingYarnClusterDescriptor is redundant and can be replaced directly with YarnClusterDescriptor. 3. NonDeployingDetachedYarnClusterDescriptor is no need. We can just use NonDeployingYarnClusterDescriptor. 4. Some methods like #createYarnClusterClient have parameters that never used, which are for historical reasons. Thus, I propose we refactor YarnClusterDescriptor inheritance YarnClusterDescriptor -> NonDeployingYarnClusterDescriptor and also methods remove unused parameters. CC [~kkl0u] [~aljoscha] [~till.rohrmann] was: Currently, the inheritance looks like {{AbstractYarnClusterDescriptor}} -> {{YarnClusterDescriptor}} -> {{TestingYarnClusterDescriptor}} -> {{NonDeployingYarnClusterDescriptor}} ->-> {{NonDeployingDetachedYarnClusterDescriptor}} With an investigation, I find 1. {{AbstractYarnClusterDescriptor}} is introduced for migration purpose and no need any more. 2. {{TestingYarnClusterDescriptor}} is redundant and can be replaced directly with {{YarnClusterDescriptor}}. 3. Some methods like {{#createYarnClusterClient}} have parameters that never used, which are for historical reasons. Thus, I propose we refactor {{YarnClusterDescriptor}} inheritance {{YarnClusterDescriptor}} -> {{NonDeployingYarnClusterDescriptor}} ->-> {{NonDeployingDetachedYarnClusterDescriptor}} and also methods remove unused parameters. CC [~kkl0u] [~aljoscha] [~till.rohrmann] > Refactor YarnClusterDescriptor inheritance > -- > > Key: FLINK-14050 > URL: https://issues.apache.org/jira/browse/FLINK-14050 > Project: Flink > Issue Type: Sub-task > Components: Client / Job Submission, Command Line Client >Affects Versions: 1.10.0 >Reporter: Zili Chen >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, the inheritance looks like > AbstractYarnClusterDescriptor > -> YarnClusterDescriptor > -> TestingYarnClusterDescriptor > -> NonDeployingYarnClusterDescriptor > -> -> NonDeployingDetachedYarnClusterDescriptor > With an investigation, I find > 1. AbstractYarnClusterDescriptor is introduced for migration purpose and no > need any more. > 2. TestingYarnClusterDescriptor is redundant and can be replaced directly > with YarnClusterDescriptor. > 3. NonDeployingDetachedYarnClusterDescriptor is no need. We can just use > NonDeployingYarnClusterDescriptor. > 4. Some methods like #createYarnClusterClient have parameters that never > used, which are for historical reasons. > Thus, I propose we refactor YarnClusterDescriptor inheritance > YarnClusterDescriptor > -> NonDeployingYarnClusterDescriptor > and also methods remove unused parameters. > CC [~kkl0u] [~aljoscha] [~till.rohrmann] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-14179) Wrong description of SqlCommand.SHOW_FUNCTIONS
[ https://issues.apache.org/jira/browse/FLINK-14179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen reassigned FLINK-14179: - Assignee: Canbin Zheng > Wrong description of SqlCommand.SHOW_FUNCTIONS > -- > > Key: FLINK-14179 > URL: https://issues.apache.org/jira/browse/FLINK-14179 > Project: Flink > Issue Type: Bug > Components: Table SQL / Client >Affects Versions: 1.9.0 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Attachments: image-2019-09-24-10-59-26-286.png > > Time Spent: 10m > Remaining Estimate: 0h > > Currently '*SHOW FUNCTIONS*' lists not only user-defined functions, but also > system-defined ones, the description {color:#172b4d}*'Shows all registered > user-defined functions.'* not correctly depicts this functionality. I think > we can change the description to '*Shows all system-defined and user-defined > functions.*'{color} > > {color:#172b4d}!image-2019-09-24-10-59-26-286.png!{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14179) Wrong description of SqlCommand.SHOW_FUNCTIONS
[ https://issues.apache.org/jira/browse/FLINK-14179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16940254#comment-16940254 ] Zili Chen commented on FLINK-14179: --- Hi [~felixzheng] I've assigned the issue to you. Will give it a review now. > Wrong description of SqlCommand.SHOW_FUNCTIONS > -- > > Key: FLINK-14179 > URL: https://issues.apache.org/jira/browse/FLINK-14179 > Project: Flink > Issue Type: Bug > Components: Table SQL / Client >Affects Versions: 1.9.0 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Attachments: image-2019-09-24-10-59-26-286.png > > Time Spent: 10m > Remaining Estimate: 0h > > Currently '*SHOW FUNCTIONS*' lists not only user-defined functions, but also > system-defined ones, the description {color:#172b4d}*'Shows all registered > user-defined functions.'* not correctly depicts this functionality. I think > we can change the description to '*Shows all system-defined and user-defined > functions.*'{color} > > {color:#172b4d}!image-2019-09-24-10-59-26-286.png!{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-13827) shell variable should be escaped in start-scala-shell.sh
[ https://issues.apache.org/jira/browse/FLINK-13827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen reassigned FLINK-13827: - Assignee: Zili Chen > shell variable should be escaped in start-scala-shell.sh > > > Key: FLINK-13827 > URL: https://issues.apache.org/jira/browse/FLINK-13827 > Project: Flink > Issue Type: Bug > Components: Scala Shell >Affects Versions: 1.9.0, 1.10.0 >Reporter: Zili Chen >Assignee: Zili Chen >Priority: Major > Fix For: 1.10.0, 1.9.2 > > > {code:java} > diff --git a/flink-scala-shell/start-script/start-scala-shell.sh > b/flink-scala-shell/start-script/start-scala-shell.sh > index b6da81af72..65b9045584 100644 > --- a/flink-scala-shell/start-script/start-scala-shell.sh > +++ b/flink-scala-shell/start-script/start-scala-shell.sh > @@ -97,9 +97,9 @@ log_setting="-Dlog.file="$LOG" > -Dlog4j.configuration=file:"$FLINK_CONF_DIR"/$LOG > > if ${EXTERNAL_LIB_FOUND} > then > -$JAVA_RUN -Dscala.color -cp "$FLINK_CLASSPATH" $log_setting > org.apache.flink.api.scala.FlinkShell $@ --addclasspath "$EXT_CLASSPATH" > +$JAVA_RUN -Dscala.color -cp "$FLINK_CLASSPATH" "$log_setting" > org.apache.flink.api.scala.FlinkShell $@ --addclasspath "$EXT_CLASSPATH" > else > -$JAVA_RUN -Dscala.color -cp "$FLINK_CLASSPATH" $log_setting > org.apache.flink.api.scala.FlinkShell $@ > +$JAVA_RUN -Dscala.color -cp "$FLINK_CLASSPATH" "$log_setting" > org.apache.flink.api.scala.FlinkShell $@ > fi > > #restore echo > {code} > otherwise it is error prone when {{$log_setting}} contain arbitrary content. > For example, if the parent dir contain whitespace, said {{flink-1.9.0 2}}, > then {{bin/start-scala-shell.sh local}} will fail with > {{Error: Could not find or load main class > 2.log.flink\-\*\-scala\-shell\-local\-\*.log}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14282) Simplify DispatcherResourceManagerComponent
[ https://issues.apache.org/jira/browse/FLINK-14282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941459#comment-16941459 ] Zili Chen commented on FLINK-14282: --- Nice to have! The difference between {{JobCluster}} and {{SessionCluster}} is better to exist in {{ClutserEntrypoint}} level, not in {{Dispatcher}} level. > Simplify DispatcherResourceManagerComponent > --- > > Key: FLINK-14282 > URL: https://issues.apache.org/jira/browse/FLINK-14282 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Minor > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > With the completion of the FLINK-14281 it is now possible to encapsulate the > shutdown logic of the {{MiniDispatcher}} within the {{DispatcherRunner}}. > Consequently, it is no longer necessary to have separate > {{DispatcherResourceManagerComponent}} implementations. I suggest to remove > the special case implementations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14282) Simplify DispatcherResourceManagerComponent
[ https://issues.apache.org/jira/browse/FLINK-14282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941460#comment-16941460 ] Zili Chen commented on FLINK-14282: --- Well I misunderstand so we still have multiple {{Dispatcher}} s. But this clean up is still valid :P > Simplify DispatcherResourceManagerComponent > --- > > Key: FLINK-14282 > URL: https://issues.apache.org/jira/browse/FLINK-14282 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Minor > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > With the completion of the FLINK-14281 it is now possible to encapsulate the > shutdown logic of the {{MiniDispatcher}} within the {{DispatcherRunner}}. > Consequently, it is no longer necessary to have separate > {{DispatcherResourceManagerComponent}} implementations. I suggest to remove > the special case implementations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13417) Bump Zookeeper to 3.5.5
[ https://issues.apache.org/jira/browse/FLINK-13417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942545#comment-16942545 ] Zili Chen commented on FLINK-13417: --- For your information, I created {{flink-shaded-zookeeper}} under {{flink-parent}} and pass all our test suites in this [pull request|https://github.com/apache/flink/pull/9762] without modification outside maven {{pom.xml}}. I'm going to push a branch against master repo for running e2e tests and dig out deeper. > Bump Zookeeper to 3.5.5 > --- > > Key: FLINK-13417 > URL: https://issues.apache.org/jira/browse/FLINK-13417 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Konstantin Knauf >Priority: Major > Fix For: 1.10.0 > > > User might want to secure their Zookeeper connection via SSL. > This requires a Zookeeper version >= 3.5.1. We might as well try to bump it > to 3.5.5, which is the latest version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13827) shell variable should be escaped in start-scala-shell.sh
[ https://issues.apache.org/jira/browse/FLINK-13827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942754#comment-16942754 ] Zili Chen commented on FLINK-13827: --- master via 865cc4c7a39f7aa610a02cc4a0f41424edcd6279 > shell variable should be escaped in start-scala-shell.sh > > > Key: FLINK-13827 > URL: https://issues.apache.org/jira/browse/FLINK-13827 > Project: Flink > Issue Type: Bug > Components: Scala Shell >Affects Versions: 1.9.0, 1.10.0 >Reporter: Zili Chen >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0, 1.9.2 > > Time Spent: 20m > Remaining Estimate: 0h > > {code:java} > diff --git a/flink-scala-shell/start-script/start-scala-shell.sh > b/flink-scala-shell/start-script/start-scala-shell.sh > index b6da81af72..65b9045584 100644 > --- a/flink-scala-shell/start-script/start-scala-shell.sh > +++ b/flink-scala-shell/start-script/start-scala-shell.sh > @@ -97,9 +97,9 @@ log_setting="-Dlog.file="$LOG" > -Dlog4j.configuration=file:"$FLINK_CONF_DIR"/$LOG > > if ${EXTERNAL_LIB_FOUND} > then > -$JAVA_RUN -Dscala.color -cp "$FLINK_CLASSPATH" $log_setting > org.apache.flink.api.scala.FlinkShell $@ --addclasspath "$EXT_CLASSPATH" > +$JAVA_RUN -Dscala.color -cp "$FLINK_CLASSPATH" "$log_setting" > org.apache.flink.api.scala.FlinkShell $@ --addclasspath "$EXT_CLASSPATH" > else > -$JAVA_RUN -Dscala.color -cp "$FLINK_CLASSPATH" $log_setting > org.apache.flink.api.scala.FlinkShell $@ > +$JAVA_RUN -Dscala.color -cp "$FLINK_CLASSPATH" "$log_setting" > org.apache.flink.api.scala.FlinkShell $@ > fi > > #restore echo > {code} > otherwise it is error prone when {{$log_setting}} contain arbitrary content. > For example, if the parent dir contain whitespace, said {{flink-1.9.0 2}}, > then {{bin/start-scala-shell.sh local}} will fail with > {{Error: Could not find or load main class > 2.log.flink\-\*\-scala\-shell\-local\-\*.log}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13417) Bump Zookeeper to 3.5.5
[ https://issues.apache.org/jira/browse/FLINK-13417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942853#comment-16942853 ] Zili Chen commented on FLINK-13417: --- Run against master repo but it seems not trigger corn job. Could you help trigger corn job on zk-3,5-shaded branch [~trohrmann]? Although we should take several discussion about whether or not and if so, how to migrate from zk 3.4 to zk 3.5(since it enforces our users to bump their zk server version), we can concurrently verify that if there is any technical broker to do so. > Bump Zookeeper to 3.5.5 > --- > > Key: FLINK-13417 > URL: https://issues.apache.org/jira/browse/FLINK-13417 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Konstantin Knauf >Priority: Major > Fix For: 1.10.0 > > > User might want to secure their Zookeeper connection via SSL. > This requires a Zookeeper version >= 3.5.1. We might as well try to bump it > to 3.5.5, which is the latest version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-13417) Bump Zookeeper to 3.5.5
[ https://issues.apache.org/jira/browse/FLINK-13417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942853#comment-16942853 ] Zili Chen edited comment on FLINK-13417 at 10/2/19 2:20 PM: Run against master repo but it seems not trigger corn job. Could you help trigger corn job on zk-3,5-shaded branch [~trohrmann]? Although we should start a discussion about whether or not and if so, how to migrate from zk 3.4 to zk 3.5(since it enforces our users to bump their zk server version), we can concurrently verify that if there is any technical broker to do so. was (Author: tison): Run against master repo but it seems not trigger corn job. Could you help trigger corn job on zk-3,5-shaded branch [~trohrmann]? Although we should take several discussion about whether or not and if so, how to migrate from zk 3.4 to zk 3.5(since it enforces our users to bump their zk server version), we can concurrently verify that if there is any technical broker to do so. > Bump Zookeeper to 3.5.5 > --- > > Key: FLINK-13417 > URL: https://issues.apache.org/jira/browse/FLINK-13417 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Konstantin Knauf >Priority: Major > Fix For: 1.10.0 > > > User might want to secure their Zookeeper connection via SSL. > This requires a Zookeeper version >= 3.5.1. We might as well try to bump it > to 3.5.5, which is the latest version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14315) NPE with JobMaster.disconnectTaskManager
[ https://issues.apache.org/jira/browse/FLINK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942975#comment-16942975 ] Zili Chen commented on FLINK-14315: --- It looks like {{JobMaster#suspend}} races with {{JobMaster#onStop}} which {{JobMaster#suspend}} set {{taskManagerHeartbeatManager = null}} while {{JobMaster#onStop}} calls {{JobMaster#disconnectTaskManager}} calls {{taskManagerHeartbeatManager.unmonitorTarget(resourceID)}}. Another perspective is we can properly tolerate zk unstable(connection loss exception) as discussed in FLINK-10052 (it doesn't solve the problem here but make it quite more rare) CC [~trohrmann] > NPE with JobMaster.disconnectTaskManager > > > Key: FLINK-14315 > URL: https://issues.apache.org/jira/browse/FLINK-14315 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.9.0 >Reporter: Steven Zhen Wu >Priority: Major > > There was some connection issue with zookeeper that caused the job to > restart. But shutdown failed with this fatal NPE, which seems to cause JVM > to exit > {code} > 2019-10-02 16:16:19,134 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable > to read additional data from server sessionid 0x16d83374c4206f8, likely > server has clo > sed socket, closing socket connection and attempting reconnect > 2019-10-02 16:16:19,234 INFO > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager > - State change: SUSPENDED > 2019-10-02 16:16:19,235 WARN > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper suspended. Can no longer retrieve the leader from > ZooKeeper. > 2019-10-02 16:16:19,235 WARN > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper suspended. Can no longer retrieve the leader from > ZooKeeper. > 2019-10-02 16:16:19,235 WARN > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper suspended. The contender > akka.tcp://flink@100.122.177.82:42043/u > ser/dispatcher no longer participates in the leader election. > 2019-10-02 16:16:19,237 INFO > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint- > http://100.122.177.82:8081 lost leadership > 2019-10-02 16:16:19,237 INFO > com.netflix.spaas.runtime.resourcemanager.TitusResourceManager - > ResourceManager akka.tcp://flink@100.122.177.82:42043/user/resourcemanager > was revoked leadershi > p. Clearing fencing token. > 2019-10-02 16:16:19,237 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Stopping ZooKeeperLeaderRetrievalService > /leader/e4e68f2b3fc40c7008cca624b2a2bab0/job_ > manager_lock. > 2019-10-02 16:16:19,237 WARN > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not > monitored (tem > porarily). > 2019-10-02 16:16:19,238 INFO > org.apache.flink.runtime.jobmaster.JobManagerRunner - JobManager > for job ksrouter (e4e68f2b3fc40c7008cca624b2a2bab0) was revoked leadership at > akka.tcp: > //flink@100.122.177.82:42043/user/jobmanager_0. > 2019-10-02 16:16:19,239 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. > 2019-10-02 16:16:19,239 WARN > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper suspended. The contender http://100.122.177.82:8081 > no longer pa > rticipates in the leader election. > 2019-10-02 16:16:19,239 WARN > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper suspended. Can no longer retrieve the leader from > ZooKeeper. > 2019-10-02 16:16:19,239 WARN > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Connection to ZooKeeper suspended. The contender > akka.tcp://flink@100.122.177.82:42043/u > ser/jobmanager_0 no longer participates in the leader election. > 2019-10-02 16:16:19,239 WARN > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Connection to ZooKeeper suspended. Can no longer retrieve the leader from > ZooKeeper. > 2019-10-02 16:16:19,239 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph- Job ksrouter > (e4e68f2b3fc40c7008cca624b2a2bab0) switched from state RUNNING to SUSPENDED. > org.apache.flink.util.FlinkException: JobManager is no longer the leader. > at > org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeJobMasterLeadership(JobManagerRunner.java:391) > at > org.apache.flink.runtime.jobmaster.JobM
[jira] [Commented] (FLINK-13417) Bump Zookeeper to 3.5.5
[ https://issues.apache.org/jira/browse/FLINK-13417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943594#comment-16943594 ] Zili Chen commented on FLINK-13417: --- Hello Stephan, the situation is actually flink-runtime uses zookeeper(flink-mesos used a wrapper version of curator's shared value which satisfy use through a utility way) so thay I think the way you proposed is worth to try. However, there's one prerequisite of this way. That's we need to bump curator-test to 4.x(related FLINK-10052) because curator-test 2.x uses zk 3.4 for starting server so that create container message from zk 2.5 client cannot be identified. > Bump Zookeeper to 3.5.5 > --- > > Key: FLINK-13417 > URL: https://issues.apache.org/jira/browse/FLINK-13417 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Konstantin Knauf >Priority: Major > Fix For: 1.10.0 > > > User might want to secure their Zookeeper connection via SSL. > This requires a Zookeeper version >= 3.5.1. We might as well try to bump it > to 3.5.5, which is the latest version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13417) Bump Zookeeper to 3.5.5
[ https://issues.apache.org/jira/browse/FLINK-13417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944254#comment-16944254 ] Zili Chen commented on FLINK-13417: --- Another option to avoid dependency to bump curator is excluding zk from curator dependencies as done in pr of FLINK-10052. I am starting a new branch to verify it. > Bump Zookeeper to 3.5.5 > --- > > Key: FLINK-13417 > URL: https://issues.apache.org/jira/browse/FLINK-13417 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Konstantin Knauf >Priority: Major > Fix For: 1.10.0 > > > User might want to secure their Zookeeper connection via SSL. > This requires a Zookeeper version >= 3.5.1. We might as well try to bump it > to 3.5.5, which is the latest version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14237) No need to rename shipped Flink jar
[ https://issues.apache.org/jira/browse/FLINK-14237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946565#comment-16946565 ] Zili Chen commented on FLINK-14237: --- Hi [~trohrmann], {{flink.jar}} is shipped as flink jar on both JM and TM. All usage just happens to be hardcoded. On TM part, we actually configure the name via {{YarnConfigKeys.FLINK_JAR_PATH}}. On JM part, we happens to hardcode the name as {{flink.jar}} and add it to JM classpath. A nit pull request can let JM respect the real name of flink uber jar. > No need to rename shipped Flink jar > --- > > Key: FLINK-14237 > URL: https://issues.apache.org/jira/browse/FLINK-14237 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Zili Chen >Assignee: Zili Chen >Priority: Major > Fix For: 1.10.0 > > > Currently, when we ship Flink jar configured by -yj, we always rename it as > {{flink.jar}}. It seems a redundant operation since we can always use the > exact name of the real jar. It also causes some confusion to our users who > should not be required to know about Flink internal implementation that they > configure a specific Flink jar(said {{flink-private-version-suffix.jar}}) but > cannot find it on YARN container, because it is now {{flink.jar}}. > CC [~trohrmann] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14343) Remove uncompleted YARNHighAvailabilityService
Zili Chen created FLINK-14343: - Summary: Remove uncompleted YARNHighAvailabilityService Key: FLINK-14343 URL: https://issues.apache.org/jira/browse/FLINK-14343 Project: Flink Issue Type: Task Components: Runtime / Coordination Reporter: Zili Chen Assignee: Zili Chen Fix For: 1.10.0 Corresponding mailing list [thread|https://lists.apache.org/x/thread.html/6022f2124be91e3f4667d61a977ea0639e2c19286560d6d1cb874792@%3Cdev.flink.apache.org%3E]. Noticed that there are several stale & uncompleted high-availability services implementations, I start this thread in order to see whether or not we can remove them for a clean codebase. Below are all of classes I noticed. - YarnHighAvailabilityServices - AbstractYarnNonHaServices - YarnIntraNonHaMasterServices - YarnPreConfiguredMasterNonHaServices - SingleLeaderElectionService - FsNegativeRunningJobsRegistry (as well as their dedicated tests) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11632) Make TaskManager automatic bind address picking more explicit (by default) and more configurable
[ https://issues.apache.org/jira/browse/FLINK-11632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946664#comment-16946664 ] Zili Chen commented on FLINK-11632: --- Hi [~trohrmann] & [~1u0] I'll appreciate it if you can provide some background here. I'm curious in which case we'd like to actually turn to the heuristics mechanism? Can we always try {{InetAddress.getLocalHost}} and fail if we cannot access via localhost? It would be helpful to provide a real-world example while the heuristics mechanism works. The original concern I arrive here is that such a heuristics mechanism enforces a dependency from rpc service in {{TaskManagerRunner}} to ha service in {{TaskManagerRunner}} so that we can get RM address as a target address. It could be an alternative perspective of my concern that if only we can get a target address without depend on ha service. > Make TaskManager automatic bind address picking more explicit (by default) > and more configurable > > > Key: FLINK-11632 > URL: https://issues.apache.org/jira/browse/FLINK-11632 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Network >Reporter: Alex >Assignee: Alex >Priority: Minor > Labels: pull-request-available > Fix For: 1.8.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Currently, there is an optional {{taskmanager.host}} configuration option in > {{flink-conf.yaml}} that allows users of Flink to "statically" pre-define > what should be a bind address for TaskManager to listen on (note: it's also > possible to override this option by passing corresponding command line option > to Flink). > In case when the option is not set, TaskManager would try [heuristically pick > up a bind > address|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L421-L442]. > The resulting address (hostname) is used to advertise different service > endpoints (running in TM) to the JobManager. Also it would be resolved to an > {{[InetAddress|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L359]}} > later that used as binding address for TMs inner node communication. > This proposal is to minimize usage of heuristics (by default) by introducing > a new configuration option (for example, {{taskmanager.host.bind-policy}}) > with possible values: > * {{"hostname"}} - default, use TM's host's name ({{== > InetAddress.getLocalHost().getHostName()}}; > * {{"ip"}} - use TM's host's ip address ({{== > InetAddress.getLocalHost().getHostAddress()}}); > * {{"auto-detect-hostname"}} - use the heuristics based detection mechanism. > *Note:* the configuration key and values could be named better and open for > proposals. > *Note 2:* in the future, the configuration option _may_ require to be > extended to allow choosing some specific network interface, or preference of > ipv6 vs ipv4. > h3. Rationale > [The heuristics > mechanism|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/net/ConnectionUtils.java#L364-L475] > tries to establish a probe connection to {{jobmanager.rpc.address}} from > different network interface addresses. > In case of parallel setups (when JM and multiple TMs start simultaneously, > in parallel), this depends on timing, assigned network ip addresses and may > end up with "non-uniform" address bindings of TMs (some may be "lucky" to > pick up non default network interface, some would fallback to > {{InetAddress.getLocalHost().getHostName()}}. At the end, it's less obvious > and transparent which binding address a TM picks up. > In practice, it's possible that in majority of cases (in well setup > environments) the heuristics mechanism returns a result that matches > {{InetAddress.getLocalHost()}}. The proposal is to stick with this more > simpler and explicit binding (by default), avoiding non-determinism of > heuristics. > The old mechanism is kept available, in case if it is useful in some setups. > But would require explicit configuration setting. > Additionally, this proposal extends "auto configuration" option by allowing > users to choose the host's ip address (instead of hostname). This may be > convenient in situations where the TMs' machines are not necessary reachable > via DNS (for example in a Kubernetes setup). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-14118) Reduce the unnecessary flushing when there is no data available for flush
[ https://issues.apache.org/jira/browse/FLINK-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen reassigned FLINK-14118: - Assignee: Yingjie Cao > Reduce the unnecessary flushing when there is no data available for flush > - > > Key: FLINK-14118 > URL: https://issues.apache.org/jira/browse/FLINK-14118 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: Yingjie Cao >Assignee: Yingjie Cao >Priority: Critical > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > The new flush implementation which works by triggering a netty user event may > cause performance regression compared to the old synchronization-based one. > More specifically, when there is exactly one BufferConsumer in the buffer > queue of subpartition and no new data will be added for a while in the future > (may because of just no input or the logic of the operator is to collect some > data for processing and will not emit records immediately), that is, there is > no data to send, the OutputFlusher will continuously notify data available > and wake up the netty thread, though no data will be returned by the > pollBuffer method. > For some of our production jobs, this will incur 20% to 40% CPU overhead > compared to the old implementation. We tried to fix the problem by checking > if there is new data available when flushing, if there is no new data, the > netty thread will not be notified. It works for our jobs and the cpu usage > falls to previous level. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11632) Make TaskManager automatic bind address picking more explicit (by default) and more configurable
[ https://issues.apache.org/jira/browse/FLINK-11632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946965#comment-16946965 ] Zili Chen commented on FLINK-11632: --- Thanks for your information [~trohrmann]. I wonder whether we stick to RM address(or JM address, which is identical). Is there other fix address we can rely on as {{targetAddress}}? > Make TaskManager automatic bind address picking more explicit (by default) > and more configurable > > > Key: FLINK-11632 > URL: https://issues.apache.org/jira/browse/FLINK-11632 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Network >Reporter: Alex >Assignee: Alex >Priority: Minor > Labels: pull-request-available > Fix For: 1.8.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Currently, there is an optional {{taskmanager.host}} configuration option in > {{flink-conf.yaml}} that allows users of Flink to "statically" pre-define > what should be a bind address for TaskManager to listen on (note: it's also > possible to override this option by passing corresponding command line option > to Flink). > In case when the option is not set, TaskManager would try [heuristically pick > up a bind > address|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L421-L442]. > The resulting address (hostname) is used to advertise different service > endpoints (running in TM) to the JobManager. Also it would be resolved to an > {{[InetAddress|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L359]}} > later that used as binding address for TMs inner node communication. > This proposal is to minimize usage of heuristics (by default) by introducing > a new configuration option (for example, {{taskmanager.host.bind-policy}}) > with possible values: > * {{"hostname"}} - default, use TM's host's name ({{== > InetAddress.getLocalHost().getHostName()}}; > * {{"ip"}} - use TM's host's ip address ({{== > InetAddress.getLocalHost().getHostAddress()}}); > * {{"auto-detect-hostname"}} - use the heuristics based detection mechanism. > *Note:* the configuration key and values could be named better and open for > proposals. > *Note 2:* in the future, the configuration option _may_ require to be > extended to allow choosing some specific network interface, or preference of > ipv6 vs ipv4. > h3. Rationale > [The heuristics > mechanism|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/net/ConnectionUtils.java#L364-L475] > tries to establish a probe connection to {{jobmanager.rpc.address}} from > different network interface addresses. > In case of parallel setups (when JM and multiple TMs start simultaneously, > in parallel), this depends on timing, assigned network ip addresses and may > end up with "non-uniform" address bindings of TMs (some may be "lucky" to > pick up non default network interface, some would fallback to > {{InetAddress.getLocalHost().getHostName()}}. At the end, it's less obvious > and transparent which binding address a TM picks up. > In practice, it's possible that in majority of cases (in well setup > environments) the heuristics mechanism returns a result that matches > {{InetAddress.getLocalHost()}}. The proposal is to stick with this more > simpler and explicit binding (by default), avoiding non-determinism of > heuristics. > The old mechanism is kept available, in case if it is useful in some setups. > But would require explicit configuration setting. > Additionally, this proposal extends "auto configuration" option by allowing > users to choose the host's ip address (instead of hostname). This may be > convenient in situations where the TMs' machines are not necessary reachable > via DNS (for example in a Kubernetes setup). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14149) Introduce ZooKeeperLeaderElectionServiceNG
[ https://issues.apache.org/jira/browse/FLINK-14149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-14149: -- Description: Subsequent to the discussion in FLINK-10333, we reach a consensus that refactor ZK based storage with a transaction store mechanism. The overall design can be found in the design document linked below. This subtask is aimed at introducing the prerequisite to adopt transaction store, i.e., a new leader election service for ZK scenario. The necessity is that we have to retrieve the corresponding latch path per contender following the algorithm describe in FLINK-10333. Here is the (descriptive) details about the implementation. We adopt the optimized version of [this recipe|https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection][1]. Code details can be found in [this branch|https://github.com/TisonKun/flink/tree/election-service] and the state machine can be found in the design document attached. Here is only the most important difference from the former implementation: *Leader election is an one-shot service.* Specifically, we only create one latch for a specific contender. We tolerate {{SUSPENDED}} a.k.a. {{CONNECTIONLOSS}} so that the only situation we lost leadership is session expired, which infers the ephemeral latch znode is deleted. We don't re-participant as contender so after {{revokeLeadership}} a contender will never be granted any more. This is not a problem but we can do further refactor in contender side for better behavior. Another topic is about interface. Back to the big picture of FLINK-10333 we eventually use a transaction store for persisting job graph and checkpoint and so on. So there will be a {{getLeaderStore}} method added on {{LeaderElectionServices}}. Because we don't use it at all it is an open question that whether we add the method to the interface in this subtask. And if so, whether we implement it for other election services implementation. {{concealLeaderInfo}} is another method appeared in the document that aimed at clean up leader info node on stop. So the same problem as {{getLeaderStore}}. **For what we gain** 1. Basics for the overall goal under FLINK-10333 2. Leader info node must be modified by the current leader. Thus we can reduce a lot of concurrency handling logic in currently ZLES, including using {{NodeCache}} as well as dealing with complex stat of ephemeral leader info node. [1] For other implementation, I start [a thread|https://lists.apache.org/x/thread.html/594b66ecb1d60b560a5c4c08ed1b2a67bc29143cb4e8d368da8c39b2@%3Cuser.zookeeper.apache.org%3E] in ZK and Curator to discuss. Anyway, it will be implementation details only, and interfaces and semantics should not be affected. was: Subsequent to the discussion in FLINK-10333, we reach a consensus that refactor ZK based storage with a transaction store mechanism. The overall design can be found in the design document linked below. This subtask is aimed at introducing the prerequisite to adopt transaction store, i.e., a new leader election service for ZK scenario. The necessity is that we have to retrieve the corresponding latch path per contender following the algorithm describe in FLINK-10333. Here is the (descriptive) details about the implementation. We adopt the optimized version of [this recipe|https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection][1]. Code details can be found in [this branch|https://github.com/TisonKun/flink/tree/election-service] and the state machine can be found in the design document attached. Here is only the most important two differences from the former implementation: (1) *Leader election is an one-shot service.* Specifically, we only create one latch for a specific contender. We tolerate {{SUSPENDED}} a.k.a. {{CONNECTIONLOSS}} so that the only situation we lost leadership is session expired, which infers the ephemeral latch znode is deleted. We don't re-participant as contender so after {{revokeLeadership}} a contender will never be granted any more. This is not a problem but we can do further refactor in contender side for better behavior. (2) *Leader info znode is {{PERSISTENT}}.* It is because we now regard create/setData to leader info znode a leader-only operation and thus do it in a transaction. If we keep using ephemeral znode it is hard to test. Because we share ZK client so the ephemeral znode is not deleted so that we should deal with complex znode stat that transaction cannot simply deal with. And since znode is {{PERSISTENT}} we introduce a {{concealLeaderInfo}} method called back on contender stop to clean up. Another topic is about interface. Back to the big picture of FLINK-10333 we eventually use a transaction store for persisting job graph and checkpoint and so on. So there will be a {{getLeaderStore}} method added on {{LeaderElectionServices}}. Because
[jira] [Issue Comment Deleted] (FLINK-13567) Avro Confluent Schema Registry nightly end-to-end test failed on Travis
[ https://issues.apache.org/jira/browse/FLINK-13567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-13567: -- Comment: was deleted (was: [~till.rohrmann] you can try out the patch attached. I don't have a test env now :() > Avro Confluent Schema Registry nightly end-to-end test failed on Travis > --- > > Key: FLINK-13567 > URL: https://issues.apache.org/jira/browse/FLINK-13567 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka, Tests >Affects Versions: 1.8.2, 1.9.0, 1.10.0 >Reporter: Till Rohrmann >Priority: Critical > Labels: test-stability > Fix For: 1.10.0 > > > The {{Avro Confluent Schema Registry nightly end-to-end test}} failed on > Travis with > {code} > [FAIL] 'Avro Confluent Schema Registry nightly end-to-end test' failed after > 2 minutes and 11 seconds! Test exited with exit code 1 > No taskexecutor daemon (pid: 29044) is running anymore on > travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501. > No standalonesession daemon to stop on host > travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501. > rm: cannot remove > '/home/travis/build/apache/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/plugins': > No such file or directory > {code} > https://api.travis-ci.org/v3/job/567273939/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-13567) Avro Confluent Schema Registry nightly end-to-end test failed on Travis
[ https://issues.apache.org/jira/browse/FLINK-13567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-13567: -- Attachment: (was: patch.diff) > Avro Confluent Schema Registry nightly end-to-end test failed on Travis > --- > > Key: FLINK-13567 > URL: https://issues.apache.org/jira/browse/FLINK-13567 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka, Tests >Affects Versions: 1.8.2, 1.9.0, 1.10.0 >Reporter: Till Rohrmann >Priority: Critical > Labels: test-stability > Fix For: 1.10.0 > > > The {{Avro Confluent Schema Registry nightly end-to-end test}} failed on > Travis with > {code} > [FAIL] 'Avro Confluent Schema Registry nightly end-to-end test' failed after > 2 minutes and 11 seconds! Test exited with exit code 1 > No taskexecutor daemon (pid: 29044) is running anymore on > travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501. > No standalonesession daemon to stop on host > travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501. > rm: cannot remove > '/home/travis/build/apache/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/plugins': > No such file or directory > {code} > https://api.travis-ci.org/v3/job/567273939/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string
[ https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948773#comment-16948773 ] Zili Chen commented on FLINK-14347: --- Thanks for reporting this issue [~TsReaper] and your analysis [~trohrmann]. I notice that the instability depends on whether or not {{jobmanager.log}} has been dumped on the verification. Given that the "forbidden" string is actually expected[1] I propose we add the next line into whitelist. Locally verify when {{jobmanager.log}} dumped we find the "forbidden" string and filter out with the exclusion of expected Exception. {{Received shutdown request from YARN ResourceManager}} [1] Specifically, we call {{YarnClient.killApplication}} in {{YARNSessionFIFOITCase#runDetachedModeTest}} which always causes a shutdown request. > YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with > prohibited string > -- > > Key: FLINK-14347 > URL: https://issues.apache.org/jira/browse/FLINK-14347 > Project: Flink > Issue Type: Test > Components: Deployment / YARN, Tests >Affects Versions: 1.10.0, 1.9.1, 1.8.3 >Reporter: Caizhi Weng >Priority: Critical > Fix For: 1.10.0, 1.9.1, 1.8.3 > > > YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following > exception: > {code:java} > 14:55:27.643 [ERROR] > YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461 > Found a file > /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log > with a prohibited string (one of [Exception, Started > SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code} > Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string
[ https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen reassigned FLINK-14347: - Assignee: Zili Chen > YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with > prohibited string > -- > > Key: FLINK-14347 > URL: https://issues.apache.org/jira/browse/FLINK-14347 > Project: Flink > Issue Type: Test > Components: Deployment / YARN, Tests >Affects Versions: 1.10.0, 1.9.1, 1.8.3 >Reporter: Caizhi Weng >Assignee: Zili Chen >Priority: Critical > Fix For: 1.10.0, 1.9.1, 1.8.3 > > > YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following > exception: > {code:java} > 14:55:27.643 [ERROR] > YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461 > Found a file > /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log > with a prohibited string (one of [Exception, Started > SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code} > Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string
[ https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949064#comment-16949064 ] Zili Chen commented on FLINK-14347: --- Thanks for your information [~fly_in_gis]! > YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with > prohibited string > -- > > Key: FLINK-14347 > URL: https://issues.apache.org/jira/browse/FLINK-14347 > Project: Flink > Issue Type: Test > Components: Deployment / YARN, Tests >Affects Versions: 1.10.0, 1.9.1, 1.8.3 >Reporter: Caizhi Weng >Assignee: Zili Chen >Priority: Critical > Labels: pull-request-available > Fix For: 1.10.0, 1.9.1, 1.8.3 > > Time Spent: 10m > Remaining Estimate: 0h > > YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following > exception: > {code:java} > 14:55:27.643 [ERROR] > YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461 > Found a file > /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log > with a prohibited string (one of [Exception, Started > SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code} > Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string
[ https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949065#comment-16949065 ] Zili Chen commented on FLINK-14347: --- [~trohrmann] PR submitted. > YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with > prohibited string > -- > > Key: FLINK-14347 > URL: https://issues.apache.org/jira/browse/FLINK-14347 > Project: Flink > Issue Type: Test > Components: Deployment / YARN, Tests >Affects Versions: 1.10.0, 1.9.1, 1.8.3 >Reporter: Caizhi Weng >Assignee: Zili Chen >Priority: Critical > Labels: pull-request-available > Fix For: 1.10.0, 1.9.1, 1.8.3 > > Time Spent: 10m > Remaining Estimate: 0h > > YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following > exception: > {code:java} > 14:55:27.643 [ERROR] > YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461 > Found a file > /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log > with a prohibited string (one of [Exception, Started > SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code} > Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13417) Bump Zookeeper to 3.5.5
[ https://issues.apache.org/jira/browse/FLINK-13417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949076#comment-16949076 ] Zili Chen commented on FLINK-13417: --- Hi [~sewen], [here|https://github.com/apache/flink/pull/9840] is the verifying pr according to your suggestion. > Bump Zookeeper to 3.5.5 > --- > > Key: FLINK-13417 > URL: https://issues.apache.org/jira/browse/FLINK-13417 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Konstantin Knauf >Priority: Major > Fix For: 1.10.0 > > > User might want to secure their Zookeeper connection via SSL. > This requires a Zookeeper version >= 3.5.1. We might as well try to bump it > to 3.5.5, which is the latest version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14392) Introduce JobClient API(FLIP-74)
Zili Chen created FLINK-14392: - Summary: Introduce JobClient API(FLIP-74) Key: FLINK-14392 URL: https://issues.apache.org/jira/browse/FLINK-14392 Project: Flink Issue Type: New Feature Components: Client / Job Submission, Command Line Client Reporter: Zili Chen Assignee: Zili Chen Fix For: 1.10.0 This is the umbrella issue to track all efforts toward {{JobClient}} proposed in [FLIP-74|https://cwiki.apache.org/confluence/display/FLINK/FLIP-74%3A+Flink+JobClient+API]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14392) Introduce JobClient API(FLIP-74)
[ https://issues.apache.org/jira/browse/FLINK-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-14392: -- Issue Type: New Feature (was: Task) > Introduce JobClient API(FLIP-74) > > > Key: FLINK-14392 > URL: https://issues.apache.org/jira/browse/FLINK-14392 > Project: Flink > Issue Type: New Feature > Components: Client / Job Submission, Command Line Client >Reporter: Zili Chen >Assignee: Zili Chen >Priority: Major > Fix For: 1.10.0 > > > This is the umbrella issue to track all efforts toward {{JobClient}} proposed > in > [FLIP-74|https://cwiki.apache.org/confluence/display/FLINK/FLIP-74%3A+Flink+JobClient+API]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14392) Introduce JobClient API(FLIP-74)
[ https://issues.apache.org/jira/browse/FLINK-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-14392: -- Issue Type: Task (was: New Feature) > Introduce JobClient API(FLIP-74) > > > Key: FLINK-14392 > URL: https://issues.apache.org/jira/browse/FLINK-14392 > Project: Flink > Issue Type: Task > Components: Client / Job Submission, Command Line Client >Reporter: Zili Chen >Assignee: Zili Chen >Priority: Major > Fix For: 1.10.0 > > > This is the umbrella issue to track all efforts toward {{JobClient}} proposed > in > [FLIP-74|https://cwiki.apache.org/confluence/display/FLINK/FLIP-74%3A+Flink+JobClient+API]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14392) Introduce JobClient API(FLIP-74)
[ https://issues.apache.org/jira/browse/FLINK-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-14392: -- Component/s: (was: Command Line Client) > Introduce JobClient API(FLIP-74) > > > Key: FLINK-14392 > URL: https://issues.apache.org/jira/browse/FLINK-14392 > Project: Flink > Issue Type: New Feature > Components: Client / Job Submission >Reporter: Zili Chen >Assignee: Zili Chen >Priority: Major > Fix For: 1.10.0 > > > This is the umbrella issue to track all efforts toward {{JobClient}} proposed > in > [FLIP-74|https://cwiki.apache.org/confluence/display/FLINK/FLIP-74%3A+Flink+JobClient+API]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-13856) Reduce the delete file api when the checkpoint is completed
[ https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen reassigned FLINK-13856: - Assignee: andrew.D.lin > Reduce the delete file api when the checkpoint is completed > --- > > Key: FLINK-13856 > URL: https://issues.apache.org/jira/browse/FLINK-13856 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.8.1, 1.9.0 >Reporter: andrew.D.lin >Assignee: andrew.D.lin >Priority: Major > Labels: pull-request-available > Attachments: after.png, before.png, > f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png > > Original Estimate: 48h > Time Spent: 10m > Remaining Estimate: 47h 50m > > When the new checkpoint is completed, an old checkpoint will be deleted by > calling CompletedCheckpoint.discardOnSubsume(). > When deleting old checkpoints, follow these steps: > 1, drop the metadata > 2, discard private state objects > 3, discard location as a whole > In some cases, is it possible to delete the checkpoint folder recursively by > one call? > As far as I know the full amount of checkpoint, it should be possible to > delete the folder directly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-33357) add Apache Software License 2
[ https://issues.apache.org/jira/browse/FLINK-33357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen reassigned FLINK-33357: - Assignee: 蔡灿材 > add Apache Software License 2 > - > > Key: FLINK-33357 > URL: https://issues.apache.org/jira/browse/FLINK-33357 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.6.0 >Reporter: 蔡灿材 >Assignee: 蔡灿材 >Priority: Minor > Labels: pull-request-available > Fix For: kubernetes-operator-1.5.0 > > Attachments: 2023-10-25 12-08-58屏幕截图.png > > > Flinkdeployments.flink.apache.org - v1. Currently yml and > flinksessionjobs.flink.apache.org - v1. Yml don't > add add Apache Software License 2 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33602) Pulsar connector should be compatible with Flink 1.18
Zili Chen created FLINK-33602: - Summary: Pulsar connector should be compatible with Flink 1.18 Key: FLINK-33602 URL: https://issues.apache.org/jira/browse/FLINK-33602 Project: Flink Issue Type: Bug Components: Connectors / Pulsar Affects Versions: 1.18.0, pulsar-4.1.0 Reporter: Zili Chen Assignee: Zili Chen Currently, the build and test job always fails - https://github.com/apache/flink-connector-pulsar/actions/runs/6937440214 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (FLINK-33400) Pulsar connector doesn't compile for Flink 1.18 due to Archunit update
[ https://issues.apache.org/jira/browse/FLINK-33400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen closed FLINK-33400. - Fix Version/s: pulsar-4.1.0 Assignee: Zili Chen (was: Martijn Visser) Resolution: Fixed master via 707e49472d557bafa58013c17e3194b64fb4b3ef > Pulsar connector doesn't compile for Flink 1.18 due to Archunit update > -- > > Key: FLINK-33400 > URL: https://issues.apache.org/jira/browse/FLINK-33400 > Project: Flink > Issue Type: Bug > Components: Connectors / Pulsar >Affects Versions: pulsar-4.0.1 >Reporter: Martijn Visser >Assignee: Zili Chen >Priority: Blocker > Labels: pull-request-available > Fix For: pulsar-4.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33970) Add necessary checks for connector document
[ https://issues.apache.org/jira/browse/FLINK-33970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-33970: -- Fix Version/s: pulsar-4.2.0 pulsar-4.1.1 > Add necessary checks for connector document > --- > > Key: FLINK-33970 > URL: https://issues.apache.org/jira/browse/FLINK-33970 > Project: Flink > Issue Type: Improvement > Components: Documentation >Reporter: Leonard Xu >Assignee: Zhongqiang Gong >Priority: Major > Labels: pull-request-available > Fix For: pulsar-4.2.0, pulsar-4.1.1 > > > In FLINK-33964, we found the documentation files in independent connector > repos lacks basic checks like broken url, this ticket aims to add necessary > checks and avoid similar issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (FLINK-33970) Add necessary checks for connector document
[ https://issues.apache.org/jira/browse/FLINK-33970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen closed FLINK-33970. - Resolution: Fixed master via 5b70da8f88e21057a5c590d139eab558f87e5dca Thanks a lot [~gongzhongqiang]! > Add necessary checks for connector document > --- > > Key: FLINK-33970 > URL: https://issues.apache.org/jira/browse/FLINK-33970 > Project: Flink > Issue Type: Improvement > Components: Documentation >Reporter: Leonard Xu >Assignee: Zhongqiang Gong >Priority: Major > Labels: pull-request-available > > In FLINK-33964, we found the documentation files in independent connector > repos lacks basic checks like broken url, this ticket aims to add necessary > checks and avoid similar issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-34629) Pulsar source lost topic subscribe
[ https://issues.apache.org/jira/browse/FLINK-34629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen resolved FLINK-34629. --- Fix Version/s: pulsar-4.2.0 Resolution: Fixed Master via https://github.com/apache/flink-connector-pulsar/commit/7a5eef268cb3f598589ad9cc32648ac92fbbee1d > Pulsar source lost topic subscribe > -- > > Key: FLINK-34629 > URL: https://issues.apache.org/jira/browse/FLINK-34629 > Project: Flink > Issue Type: Bug > Components: Connectors / Pulsar >Affects Versions: pulsar-3.0.1 >Reporter: WangMinChao >Assignee: WangMinChao >Priority: Major > Labels: pull-request-available > Fix For: pulsar-4.2.0 > > > The non-partition pulsar topic partition id is `-1`, using multiples of the > non-partition topics > in Pulsar source maybe lose topic subscribe. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-33884) Update Pulsar dependency to 3.0.2 in Pulsar Connector
[ https://issues.apache.org/jira/browse/FLINK-33884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen resolved FLINK-33884. --- Fix Version/s: pulsar-4.2.0 Resolution: Fixed Master via https://github.com/apache/flink-connector-pulsar/commit/9f4b902c2a478d0105eec1e32bac3ea40f318d00 > Update Pulsar dependency to 3.0.2 in Pulsar Connector > - > > Key: FLINK-33884 > URL: https://issues.apache.org/jira/browse/FLINK-33884 > Project: Flink > Issue Type: Improvement > Components: Connectors / Pulsar >Affects Versions: pulsar-4.0.1 >Reporter: David Christle >Assignee: David Christle >Priority: Major > Labels: pull-request-available > Fix For: pulsar-4.2.0 > > > The [3.0.2 > patch|https://pulsar.apache.org/release-notes/versioned/pulsar-3.0.2/] > includes various bug fixes, including a few for the Pulsar client (e.g. > [link]([https://github.com/apache/pulsar/pull/21144)). Upgrading the > dependency in the connector will pick up these fixes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-34962) flink-connector-pulsa starts failed due to incorrect use of Pulsar API: LookupService. getPartitionedTopicMetadata
[ https://issues.apache.org/jira/browse/FLINK-34962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen resolved FLINK-34962. --- Fix Version/s: pulsar-4.2.0 Resolution: Fixed master via https://github.com/apache/flink-connector-pulsar/commit/7340f713422b1734e84ec0602f154441b8da7fab > flink-connector-pulsa starts failed due to incorrect use of Pulsar API: > LookupService. getPartitionedTopicMetadata > -- > > Key: FLINK-34962 > URL: https://issues.apache.org/jira/browse/FLINK-34962 > Project: Flink > Issue Type: Bug > Components: Connectors / Pulsar >Affects Versions: pulsar-4.2.0, pulsar-4.1.1 > Environment: * flink 1.17 > * pulsar client 3.0.0 > * org.apache.flink:flink-connector-pulsar:4.1.0-1.17 (connector) >Reporter: Yubiao Feng >Priority: Major > Labels: easyfix, pull-request-available > Fix For: pulsar-4.2.0 > > > - The unnecessary codes calls > `pulsarClient.getLookup().getPartitionedTopicMetadata()` to create the > partitioned topic metadata(in fact, this behavior of is not correct) > - Why it is unnecessary: the [following > code]([https://github.com/apache/flink-connector-pulsar/blob/main/flink-connector-pulsar/src/main/java/org/apache/flink/connector/pulsar/sink/writer/topic/ProducerRegister.java#L245]) > that is creating a producer will also trigger partitioned topic metadata to > create. > - The method `pulsarClient.getLookup().getPartitionedTopicMetadata()` will > not retry if the connection is closed so that users will get an error. The > following code creates a producer that will retry if the connection is > closed, reducing the probability of an error occurring. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34962) flink-connector-pulsa starts failed due to incorrect use of Pulsar API: LookupService. getPartitionedTopicMetadata
[ https://issues.apache.org/jira/browse/FLINK-34962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-34962: -- Affects Version/s: (was: pulsar-4.1.1) > flink-connector-pulsa starts failed due to incorrect use of Pulsar API: > LookupService. getPartitionedTopicMetadata > -- > > Key: FLINK-34962 > URL: https://issues.apache.org/jira/browse/FLINK-34962 > Project: Flink > Issue Type: Bug > Components: Connectors / Pulsar > Environment: * flink 1.17 > * pulsar client 3.0.0 > * org.apache.flink:flink-connector-pulsar:4.1.0-1.17 (connector) >Reporter: Yubiao Feng >Priority: Major > Labels: easyfix, pull-request-available > Fix For: pulsar-4.2.0 > > > - The unnecessary codes calls > `pulsarClient.getLookup().getPartitionedTopicMetadata()` to create the > partitioned topic metadata(in fact, this behavior of is not correct) > - Why it is unnecessary: the [following > code]([https://github.com/apache/flink-connector-pulsar/blob/main/flink-connector-pulsar/src/main/java/org/apache/flink/connector/pulsar/sink/writer/topic/ProducerRegister.java#L245]) > that is creating a producer will also trigger partitioned topic metadata to > create. > - The method `pulsarClient.getLookup().getPartitionedTopicMetadata()` will > not retry if the connection is closed so that users will get an error. The > following code creates a producer that will retry if the connection is > closed, reducing the probability of an error occurring. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34962) flink-connector-pulsa starts failed due to incorrect use of Pulsar API: LookupService. getPartitionedTopicMetadata
[ https://issues.apache.org/jira/browse/FLINK-34962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-34962: -- Affects Version/s: (was: pulsar-4.2.0) > flink-connector-pulsa starts failed due to incorrect use of Pulsar API: > LookupService. getPartitionedTopicMetadata > -- > > Key: FLINK-34962 > URL: https://issues.apache.org/jira/browse/FLINK-34962 > Project: Flink > Issue Type: Bug > Components: Connectors / Pulsar >Affects Versions: pulsar-4.1.1 > Environment: * flink 1.17 > * pulsar client 3.0.0 > * org.apache.flink:flink-connector-pulsar:4.1.0-1.17 (connector) >Reporter: Yubiao Feng >Priority: Major > Labels: easyfix, pull-request-available > Fix For: pulsar-4.2.0 > > > - The unnecessary codes calls > `pulsarClient.getLookup().getPartitionedTopicMetadata()` to create the > partitioned topic metadata(in fact, this behavior of is not correct) > - Why it is unnecessary: the [following > code]([https://github.com/apache/flink-connector-pulsar/blob/main/flink-connector-pulsar/src/main/java/org/apache/flink/connector/pulsar/sink/writer/topic/ProducerRegister.java#L245]) > that is creating a producer will also trigger partitioned topic metadata to > create. > - The method `pulsarClient.getLookup().getPartitionedTopicMetadata()` will > not retry if the connection is closed so that users will get an error. The > following code creates a producer that will retry if the connection is > closed, reducing the probability of an error occurring. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-32645) Flink pulsar sink is having poor performance
[ https://issues.apache.org/jira/browse/FLINK-32645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen resolved FLINK-32645. --- Resolution: Fixed > Flink pulsar sink is having poor performance > > > Key: FLINK-32645 > URL: https://issues.apache.org/jira/browse/FLINK-32645 > Project: Flink > Issue Type: Bug > Components: Connectors / Pulsar >Affects Versions: 1.16.2 > Environment: !Screenshot 2023-07-22 at 1.59.42 PM.png!!Screenshot > 2023-07-22 at 2.03.53 PM.png! > >Reporter: Vijaya Bhaskar V >Assignee: Zili Chen >Priority: Major > Fix For: pulsar-3.0.2 > > Attachments: Screenshot 2023-07-22 at 2.03.53 PM.png, Screenshot > 2023-07-22 at 2.56.55 PM.png, Screenshot 2023-07-22 at 3.45.21 PM-1.png, > Screenshot 2023-07-22 at 3.45.21 PM.png, pom.xml > > > Found following issue with flink pulsar sink: > > Flink pulsar sink is always waiting while enqueueing the message and making > the task slot busy no matter how many free slots we provide. Attached the > screen shot of the same > Just sending messages of less rate 8k msg/sec and stand alone flink job with > discarding sink is able to receive full rate if 8K msg/sec > Where as pulsar sink was consuming only upto 2K msg/sec and the sink is > always busy waiting. Snapshot of thread dump attached. > Also snap shot of flink stream graph attached > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-35182) Bump org.apache.commons:commons-compress from 1.24.0 to 1.26.1 for Flink Pulsar connector
[ https://issues.apache.org/jira/browse/FLINK-35182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen resolved FLINK-35182. --- Fix Version/s: pulsar-4.2.0 Assignee: Zhongqiang Gong Resolution: Fixed master via https://github.com/apache/flink-connector-pulsar/commit/b37a8b32f30683664ff25888d403c4de414043e1 > Bump org.apache.commons:commons-compress from 1.24.0 to 1.26.1 for Flink > Pulsar connector > - > > Key: FLINK-35182 > URL: https://issues.apache.org/jira/browse/FLINK-35182 > Project: Flink > Issue Type: Technical Debt > Components: Connectors / Pulsar >Reporter: Zhongqiang Gong >Assignee: Zhongqiang Gong >Priority: Minor > Labels: pull-request-available > Fix For: pulsar-4.2.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33019) Pulsar tests hangs during nightly builds
[ https://issues.apache.org/jira/browse/FLINK-33019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762412#comment-17762412 ] Zili Chen commented on FLINK-33019: --- [~martijnvisser] It seems all for the SNAPSHOT version and with JDK 11. Is there anything that can be a (internal) breaking change with this property? > Pulsar tests hangs during nightly builds > > > Key: FLINK-33019 > URL: https://issues.apache.org/jira/browse/FLINK-33019 > Project: Flink > Issue Type: Bug > Components: Connectors / Pulsar >Reporter: Martijn Visser >Priority: Blocker > > https://github.com/apache/flink-connector-pulsar/actions/runs/6067569890/job/16459404675#step:13:25195 > The thread dump shows multiple parked/sleeping threads. No clear indicator of > what's wrong -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33053) Watcher leak in Zookeeper HA mode
[ https://issues.apache.org/jira/browse/FLINK-33053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762599#comment-17762599 ] Zili Chen commented on FLINK-33053: --- The recipe in use is {{TreeCache}}, which doesn't change from 5.0.0. And it also closes watches on {{close}}. Do you have a bisect which version introduced this regression? > Watcher leak in Zookeeper HA mode > - > > Key: FLINK-33053 > URL: https://issues.apache.org/jira/browse/FLINK-33053 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.17.0, 1.17.1 >Reporter: Yangze Guo >Priority: Critical > > We observe a watcher leak in our OLAP stress test when enabling Zookeeper HA > mode. TM's watches on the leader of JobMaster has not been stopped after job > finished. > Here is how we re-produce this issue: > - Start a session cluster and enable Zookeeper HA mode. > - Continuously and concurrently submit short queries, e.g. WordCount to the > cluster. > - echo -n wchp | nc \{zk host} \{zk port} to get current watches. > We can see a lot of watches on > /flink/\{cluster_name}/leader/\{job_id}/connection_info. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33053) Watcher leak in Zookeeper HA mode
[ https://issues.apache.org/jira/browse/FLINK-33053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762657#comment-17762657 ] Zili Chen commented on FLINK-33053: --- Perhaps you can enable debug logs and check "Removing watcher for path: " from Curator to see if the related watchers are issued removing. > Watcher leak in Zookeeper HA mode > - > > Key: FLINK-33053 > URL: https://issues.apache.org/jira/browse/FLINK-33053 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.17.0, 1.17.1 >Reporter: Yangze Guo >Priority: Critical > > We observe a watcher leak in our OLAP stress test when enabling Zookeeper HA > mode. TM's watches on the leader of JobMaster has not been stopped after job > finished. > Here is how we re-produce this issue: > - Start a session cluster and enable Zookeeper HA mode. > - Continuously and concurrently submit short queries, e.g. WordCount to the > cluster. > - echo -n wchp | nc \{zk host} \{zk port} to get current watches. > We can see a lot of watches on > /flink/\{cluster_name}/leader/\{job_id}/connection_info. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33053) Watcher leak in Zookeeper HA mode
[ https://issues.apache.org/jira/browse/FLINK-33053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762965#comment-17762965 ] Zili Chen commented on FLINK-33053: --- The log seems trimed. I saw: 2023-09-08 11:09:03,738 DEBUG org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.WatcherRemovalManager [] - Removing watcher for path: /flink/flink-native-test-117/leader/7db5c7316828f598234677e2169e7b0f/connection_info So the TM has issued watcher removal request. > Watcher leak in Zookeeper HA mode > - > > Key: FLINK-33053 > URL: https://issues.apache.org/jira/browse/FLINK-33053 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.17.0, 1.17.1 >Reporter: Yangze Guo >Priority: Critical > Attachments: 26.dump.zip, 26.log > > > We observe a watcher leak in our OLAP stress test when enabling Zookeeper HA > mode. TM's watches on the leader of JobMaster has not been stopped after job > finished. > Here is how we re-produce this issue: > - Start a session cluster and enable Zookeeper HA mode. > - Continuously and concurrently submit short queries, e.g. WordCount to the > cluster. > - echo -n wchp | nc \{zk host} \{zk port} to get current watches. > We can see a lot of watches on > /flink/\{cluster_name}/leader/\{job_id}/connection_info. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33019) Pulsar tests hangs during nightly builds
[ https://issues.apache.org/jira/browse/FLINK-33019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763587#comment-17763587 ] Zili Chen commented on FLINK-33019: --- It seems we can sometimes pass the test https://github.com/apache/flink-connector-pulsar/actions/runs/6133935359 So perhaps it's because we add the new SQL connector tests and it takes more time to complete and then cause a trivial timeout? > Pulsar tests hangs during nightly builds > > > Key: FLINK-33019 > URL: https://issues.apache.org/jira/browse/FLINK-33019 > Project: Flink > Issue Type: Bug > Components: Connectors / Pulsar >Reporter: Martijn Visser >Priority: Blocker > > https://github.com/apache/flink-connector-pulsar/actions/runs/6067569890/job/16459404675#step:13:25195 > The thread dump shows multiple parked/sleeping threads. No clear indicator of > what's wrong -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33019) Pulsar tests hangs during nightly builds
[ https://issues.apache.org/jira/browse/FLINK-33019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763587#comment-17763587 ] Zili Chen edited comment on FLINK-33019 at 9/11/23 7:57 AM: It seems we can sometimes pass the test https://github.com/apache/flink-connector-pulsar/actions/runs/6133935359 So perhaps it's because we add the new SQL connector tests whose takes more time to complete and then cause a trivial timeout? was (Author: tison): It seems we can sometimes pass the test https://github.com/apache/flink-connector-pulsar/actions/runs/6133935359 So perhaps it's because we add the new SQL connector tests and it takes more time to complete and then cause a trivial timeout? > Pulsar tests hangs during nightly builds > > > Key: FLINK-33019 > URL: https://issues.apache.org/jira/browse/FLINK-33019 > Project: Flink > Issue Type: Bug > Components: Connectors / Pulsar >Reporter: Martijn Visser >Priority: Blocker > > https://github.com/apache/flink-connector-pulsar/actions/runs/6067569890/job/16459404675#step:13:25195 > The thread dump shows multiple parked/sleeping threads. No clear indicator of > what's wrong -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33019) Pulsar tests hangs during nightly builds
[ https://issues.apache.org/jira/browse/FLINK-33019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763587#comment-17763587 ] Zili Chen edited comment on FLINK-33019 at 9/11/23 7:57 AM: It seems we can sometimes pass the test https://github.com/apache/flink-connector-pulsar/actions/runs/6133935359 So perhaps it's because we add the new SQL connector whose tests takes more time to complete and then cause a trivial timeout? was (Author: tison): It seems we can sometimes pass the test https://github.com/apache/flink-connector-pulsar/actions/runs/6133935359 So perhaps it's because we add the new SQL connector tests whose takes more time to complete and then cause a trivial timeout? > Pulsar tests hangs during nightly builds > > > Key: FLINK-33019 > URL: https://issues.apache.org/jira/browse/FLINK-33019 > Project: Flink > Issue Type: Bug > Components: Connectors / Pulsar >Reporter: Martijn Visser >Priority: Blocker > > https://github.com/apache/flink-connector-pulsar/actions/runs/6067569890/job/16459404675#step:13:25195 > The thread dump shows multiple parked/sleeping threads. No clear indicator of > what's wrong -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33053) Watcher leak in Zookeeper HA mode
[ https://issues.apache.org/jira/browse/FLINK-33053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764702#comment-17764702 ] Zili Chen commented on FLINK-33053: --- I noticed that the {{TreeCache}}'s close call {{removeWatches}} instead of {{removeAllWatches}} called by your scripts above. {{removeWatches}} only remove the watcher in client side so remain the server side watcher as is. > Watcher leak in Zookeeper HA mode > - > > Key: FLINK-33053 > URL: https://issues.apache.org/jira/browse/FLINK-33053 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.17.0, 1.18.0, 1.17.1 >Reporter: Yangze Guo >Priority: Blocker > Attachments: 26.dump.zip, 26.log, > taskmanager_flink-native-test-117-taskmanager-1-9_thread_dump (1).json > > > We observe a watcher leak in our OLAP stress test when enabling Zookeeper HA > mode. TM's watches on the leader of JobMaster has not been stopped after job > finished. > Here is how we re-produce this issue: > - Start a session cluster and enable Zookeeper HA mode. > - Continuously and concurrently submit short queries, e.g. WordCount to the > cluster. > - echo -n wchp | nc \{zk host} \{zk port} to get current watches. > We can see a lot of watches on > /flink/\{cluster_name}/leader/\{job_id}/connection_info. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33053) Watcher leak in Zookeeper HA mode
[ https://issues.apache.org/jira/browse/FLINK-33053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764721#comment-17764721 ] Zili Chen commented on FLINK-33053: --- See https://lists.apache.org/thread/3b9hn9j4c05yfztlr2zcctbg7sqwdh58. This seems to be a ZK issue that I met one year ago.. > Watcher leak in Zookeeper HA mode > - > > Key: FLINK-33053 > URL: https://issues.apache.org/jira/browse/FLINK-33053 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.17.0, 1.18.0, 1.17.1 >Reporter: Yangze Guo >Priority: Blocker > Attachments: 26.dump.zip, 26.log, > taskmanager_flink-native-test-117-taskmanager-1-9_thread_dump (1).json > > > We observe a watcher leak in our OLAP stress test when enabling Zookeeper HA > mode. TM's watches on the leader of JobMaster has not been stopped after job > finished. > Here is how we re-produce this issue: > - Start a session cluster and enable Zookeeper HA mode. > - Continuously and concurrently submit short queries, e.g. WordCount to the > cluster. > - echo -n wchp | nc \{zk host} \{zk port} to get current watches. > We can see a lot of watches on > /flink/\{cluster_name}/leader/\{job_id}/connection_info. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33053) Watcher leak in Zookeeper HA mode
[ https://issues.apache.org/jira/browse/FLINK-33053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764723#comment-17764723 ] Zili Chen commented on FLINK-33053: --- But we don't have other shared watchers so we can force remove watches as above. > Watcher leak in Zookeeper HA mode > - > > Key: FLINK-33053 > URL: https://issues.apache.org/jira/browse/FLINK-33053 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.17.0, 1.18.0, 1.17.1 >Reporter: Yangze Guo >Priority: Blocker > Attachments: 26.dump.zip, 26.log, > taskmanager_flink-native-test-117-taskmanager-1-9_thread_dump (1).json > > > We observe a watcher leak in our OLAP stress test when enabling Zookeeper HA > mode. TM's watches on the leader of JobMaster has not been stopped after job > finished. > Here is how we re-produce this issue: > - Start a session cluster and enable Zookeeper HA mode. > - Continuously and concurrently submit short queries, e.g. WordCount to the > cluster. > - echo -n wchp | nc \{zk host} \{zk port} to get current watches. > We can see a lot of watches on > /flink/\{cluster_name}/leader/\{job_id}/connection_info. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33053) Watcher leak in Zookeeper HA mode
[ https://issues.apache.org/jira/browse/FLINK-33053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764948#comment-17764948 ] Zili Chen commented on FLINK-33053: --- No. Both {{CuratorCache}} and {{TreeCache}} doesn't "own" the path so it's unclear if other recipes share the same client (connection) set up watches also. This is different from {{LeaderLatch}} which owns the path so it can ensure that no one else (should) access the related nodes. > Watcher leak in Zookeeper HA mode > - > > Key: FLINK-33053 > URL: https://issues.apache.org/jira/browse/FLINK-33053 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.17.0, 1.18.0, 1.17.1 >Reporter: Yangze Guo >Assignee: Yangze Guo >Priority: Blocker > Attachments: 26.dump.zip, 26.log, > taskmanager_flink-native-test-117-taskmanager-1-9_thread_dump (1).json > > > We observe a watcher leak in our OLAP stress test when enabling Zookeeper HA > mode. TM's watches on the leader of JobMaster has not been stopped after job > finished. > Here is how we re-produce this issue: > - Start a session cluster and enable Zookeeper HA mode. > - Continuously and concurrently submit short queries, e.g. WordCount to the > cluster. > - echo -n wchp | nc \{zk host} \{zk port} to get current watches. > We can see a lot of watches on > /flink/\{cluster_name}/leader/\{job_id}/connection_info. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33053) Watcher leak in Zookeeper HA mode
[ https://issues.apache.org/jira/browse/FLINK-33053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764950#comment-17764950 ] Zili Chen commented on FLINK-33053: --- But it's possible to add an option to explicitly identify the ownership. You can open an issue on the Curator JIRA project and let me with the other maintainers to figure it out. > Watcher leak in Zookeeper HA mode > - > > Key: FLINK-33053 > URL: https://issues.apache.org/jira/browse/FLINK-33053 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.17.0, 1.18.0, 1.17.1 >Reporter: Yangze Guo >Assignee: Yangze Guo >Priority: Blocker > Attachments: 26.dump.zip, 26.log, > taskmanager_flink-native-test-117-taskmanager-1-9_thread_dump (1).json > > > We observe a watcher leak in our OLAP stress test when enabling Zookeeper HA > mode. TM's watches on the leader of JobMaster has not been stopped after job > finished. > Here is how we re-produce this issue: > - Start a session cluster and enable Zookeeper HA mode. > - Continuously and concurrently submit short queries, e.g. WordCount to the > cluster. > - echo -n wchp | nc \{zk host} \{zk port} to get current watches. > We can see a lot of watches on > /flink/\{cluster_name}/leader/\{job_id}/connection_info. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33111) Flink Pulsar Connector to Pulsar Client Version Mismatch
[ https://issues.apache.org/jira/browse/FLINK-33111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767131#comment-17767131 ] Zili Chen commented on FLINK-33111: --- We upgrade the Pulsar client version in https://github.com/apache/flink-connector-pulsar/pull/25 which brings benefits for the new version. Perhaps we should update the document and people who use Pulsar 2.10.x can use 3.x connector. > Flink Pulsar Connector to Pulsar Client Version Mismatch > > > Key: FLINK-33111 > URL: https://issues.apache.org/jira/browse/FLINK-33111 > Project: Flink > Issue Type: Bug > Components: Connectors / Pulsar >Affects Versions: 1.17.1 >Reporter: Jason Kania >Priority: Major > > In the documentation for the Flink Pulsar Connector, > ([https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/connectors/datastream/pulsar/]) > it indicates that 2.10.0 and above versions of the pulsar client are > supported "You can use the connector with the Pulsar 2.10.0 or higher" and > the pom file entry references the 4.0.0-1.17 version of the connector which > points to the 2.11.0 version of the Pulsar client. However, when using Pulsar > Client 2.10.4 or 2.10.5, the following error is generated: > > java.lang.NoSuchMethodError: 'org.apache.pulsar.client.api.ClientBuilder > org.apache.pulsar.client.api.ClientBuilder.connectionMaxIdleSeconds(int)' > at > org.apache.flink.connector.pulsar.common.config.PulsarClientFactory.createClient(PulsarClientFactory.java:127) > at > org.apache.flink.connector.pulsar.source.reader.PulsarSourceReader.create(PulsarSourceReader.java:266) > at > org.apache.flink.connector.pulsar.source.PulsarSource.createReader(PulsarSource.java:137) > at > org.apache.flink.streaming.api.operators.SourceOperator.initReader(SourceOperator.java:312) > at > org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask.init(SourceOperatorStreamTask.java:93) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:699) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:675) > at > org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952) > at > org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:921) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > at java.base/java.lang.Thread.run(Thread.java:829) > > The referenced method 'connectionMaxIdleSeconds' is only available in the > Pulsar 2.11 client when looking at the source code. I am not sure whether the > documentation is wrong and the Flink Pulsar Connector 2.11 is the intended > Pulsar version. However, my understanding is that Pulsar 2.11 is targeted > toward java 17. This would create the need for mixed Java 11 and Java 17 > deployment unless the Pulsar client code is compiled for 2.11. > > Documentation cleanup and a reference to the appropriate Java versions is > needed. A fix to the 1.17.1 Flink pulsar connector may alternatively be > required. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-32938) flink-connector-pulsar should remove all `PulsarAdmin` calls
[ https://issues.apache.org/jira/browse/FLINK-32938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen reassigned FLINK-32938: - Assignee: Neng Lu > flink-connector-pulsar should remove all `PulsarAdmin` calls > > > Key: FLINK-32938 > URL: https://issues.apache.org/jira/browse/FLINK-32938 > Project: Flink > Issue Type: Improvement > Components: Connectors / Pulsar >Reporter: Neng Lu >Assignee: Neng Lu >Priority: Major > Labels: pull-request-available > > The flink-connector-pulsar should not access and interact with the admin > endpoint. This could introduce potential security issues. > In a production environment, a Pulsar cluster admin will not grant the > permissions for the flink application to conduct any admin operations. > Currently, the connector does various admin calls: > ```{{{}{}}}{{{}{}}} > PulsarAdmin.topics().getPartitionedTopicMetadata(topic) > PulsarAdmin.namespaces().getTopics(namespace) > PulsarAdmin.topics().getLastMessageId(topic) > PulsarAdmin.topics().getMessageIdByTimestamp(topic, timestamp) > PulsarAdmin.topics().getSubscriptions(topic) > PulsarAdmin.topics().createSubscription(topic, subscription, > MessageId.earliest) > PulsarAdmin.topics().resetCursor(topic, subscription, initial, !include) > ``` > We need to replace these calls with consumer or client calls. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-32938) flink-connector-pulsar should remove all `PulsarAdmin` calls
[ https://issues.apache.org/jira/browse/FLINK-32938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen resolved FLINK-32938. --- Fix Version/s: pulsar-4.1.0 Resolution: Fixed master via 78d00ea9e3e278d4ce2fbb0c8a8d380abef7b858 > flink-connector-pulsar should remove all `PulsarAdmin` calls > > > Key: FLINK-32938 > URL: https://issues.apache.org/jira/browse/FLINK-32938 > Project: Flink > Issue Type: Improvement > Components: Connectors / Pulsar >Reporter: Neng Lu >Assignee: Neng Lu >Priority: Major > Labels: pull-request-available > Fix For: pulsar-4.1.0 > > > The flink-connector-pulsar should not access and interact with the admin > endpoint. This could introduce potential security issues. > In a production environment, a Pulsar cluster admin will not grant the > permissions for the flink application to conduct any admin operations. > Currently, the connector does various admin calls: > ```{{{}{}}}{{{}{}}} > PulsarAdmin.topics().getPartitionedTopicMetadata(topic) > PulsarAdmin.namespaces().getTopics(namespace) > PulsarAdmin.topics().getLastMessageId(topic) > PulsarAdmin.topics().getMessageIdByTimestamp(topic, timestamp) > PulsarAdmin.topics().getSubscriptions(topic) > PulsarAdmin.topics().createSubscription(topic, subscription, > MessageId.earliest) > PulsarAdmin.topics().resetCursor(topic, subscription, initial, !include) > ``` > We need to replace these calls with consumer or client calls. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-20424) The percent of acknowledged checkpoint seems incorrect
[ https://issues.apache.org/jira/browse/FLINK-20424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen reassigned FLINK-20424: - Assignee: Andrew.D.lin > The percent of acknowledged checkpoint seems incorrect > -- > > Key: FLINK-20424 > URL: https://issues.apache.org/jira/browse/FLINK-20424 > Project: Flink > Issue Type: Improvement > Components: Runtime / Web Frontend >Reporter: zlzhang0122 >Assignee: Andrew.D.lin >Priority: Minor > Attachments: 2020-11-30 14-18-34 的屏幕截图.png > > > As the picture below, the percent of acknowledged checkpoint seems > incorrect.I think the number must not be 100% because one of the checkpoint > acknowledge was failed. > !2020-11-30 14-18-34 的屏幕截图.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20424) The percent of acknowledged checkpoint seems incorrect
[ https://issues.apache.org/jira/browse/FLINK-20424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242998#comment-17242998 ] Zili Chen commented on FLINK-20424: --- @andrew_lin Go ahead and remember update status when you start progress. > The percent of acknowledged checkpoint seems incorrect > -- > > Key: FLINK-20424 > URL: https://issues.apache.org/jira/browse/FLINK-20424 > Project: Flink > Issue Type: Improvement > Components: Runtime / Web Frontend >Reporter: zlzhang0122 >Assignee: Andrew.D.lin >Priority: Minor > Attachments: 2020-11-30 14-18-34 的屏幕截图.png > > > As the picture below, the percent of acknowledged checkpoint seems > incorrect.I think the number must not be 100% because one of the checkpoint > acknowledge was failed. > !2020-11-30 14-18-34 的屏幕截图.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20139) Enrich logs when MiniDispatcher shutting down
Zili Chen created FLINK-20139: - Summary: Enrich logs when MiniDispatcher shutting down Key: FLINK-20139 URL: https://issues.apache.org/jira/browse/FLINK-20139 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Zili Chen Assignee: Zili Chen -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-20139) Enrich logs when MiniDispatcher shutting down
[ https://issues.apache.org/jira/browse/FLINK-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen closed FLINK-20139. - Fix Version/s: 1.13.0 Resolution: Fixed master via 3dc43e6fa66d1253a37e8a4bcf242b6865e340a5 > Enrich logs when MiniDispatcher shutting down > - > > Key: FLINK-20139 > URL: https://issues.apache.org/jira/browse/FLINK-20139 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Reporter: Zili Chen >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069163#comment-17069163 ] Zili Chen commented on FLINK-16626: --- Could you reproduce that issue and share DEBUG level JM log? > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > {quote} > Actually, the job was cancelled. But the server also pr
[jira] [Comment Edited] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069163#comment-17069163 ] Zili Chen edited comment on FLINK-16626 at 3/28/20, 1:19 AM: - Could you reproduce that issue and share TRACE level JM log? Some important trace of {{AbstractHandler}} is required to locate the root cause. was (Author: tison): Could you reproduce that issue and share DEBUG level JM log? > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.
[jira] [Updated] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-16626: -- Attachment: jobmanager.log > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Attachments: jobmanager.log > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > {quote} > Actually, the job was cancelled. But the server also prints some exception: > > {quote}2020-03-17 12:
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069184#comment-17069184 ] Zili Chen commented on FLINK-16626: --- I've submitted a local YARN testing log, which indicates that "finalizeRequestProcessing" happens after cluster shutdown. Will investigate further later. > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Attachments: jobmanager.log > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069188#comment-17069188 ] Zili Chen commented on FLINK-16626: --- For coming developer: you can duplicate {{YARNITCase}}, insert a cancel command, and add log for locally debugging. > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Attachments: jobmanager.log > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFron
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069221#comment-17069221 ] Zili Chen commented on FLINK-16626: --- I found that {{org.apache.flink.runtime.rest.handler.InFlightRequestTracker#awaitAsync()}} is called more than once which cause {{Phaser}} synchronization meaningless. Further debugging... > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Attachments: jobmanager.log > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:191
[jira] [Updated] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-16626: -- Attachment: jobmanager.log > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > {quote} > Actually, the job was cancelled. But the server also prints some exception: > > {quote}20
[jira] [Updated] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-16626: -- Attachment: patch.diff > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > {quote} > Actually, the job was cancelled. But the server also prints some exception: > > {quote}2020-0
[jira] [Updated] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-16626: -- Attachment: (was: jobmanager.log) > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > {quote} > Actually, the job was cancelled. But the server also prints some exception: >
[jira] [Commented] (FLINK-16637) Flink per job mode terminates before serving job cancellation result
[ https://issues.apache.org/jira/browse/FLINK-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069225#comment-17069225 ] Zili Chen commented on FLINK-16637: --- [~gjy] I checked the code and did some tests(see FLINK-16626). It seems we should be thread safe with {{InFlightRequestTracker}} but actually {{awaitAsync}} called twice so that we lost the synchronization. I verify a closed field in {{AtomicBoolean}} as the patch attach in FLINK-16626 solve the problem but it'd be better we dig out why {{awaitAsync}} called twice. cc [~trohrmann] > Flink per job mode terminates before serving job cancellation result > > > Key: FLINK-16637 > URL: https://issues.apache.org/jira/browse/FLINK-16637 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: Zili Chen >Priority: Major > Fix For: 1.10.1, 1.11.0 > > Attachments: yarn.log > > > The {{MiniDispatcher}} no longer waits until the REST handler has served the > cancellation result before shutting down. This behaviour seems to be > introduced with FLINK-15116. > See also > [https://lists.apache.org/x/thread.html/rcadbd6ceede422bac8d4483fd0cdae58659fbff78533a399eb136743@%3Cdev.flink.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16637) Flink per job mode terminates before serving job cancellation result
[ https://issues.apache.org/jira/browse/FLINK-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069226#comment-17069226 ] Zili Chen commented on FLINK-16637: --- I think I find the problem, we register the cancel handler twice so that we close it twice org/apache/flink/runtime/webmonitor/WebMonitorEndpoint.java:589 handlers.add(Tuple2.of(jobCancelTerminationHandler.getMessageHeaders(), jobCancelTerminationHandler)); org/apache/flink/runtime/webmonitor/WebMonitorEndpoint.java:597 handlers.add(Tuple2.of(YarnCancelJobTerminationHeaders.getInstance(), jobCancelTerminationHandler)); > Flink per job mode terminates before serving job cancellation result > > > Key: FLINK-16637 > URL: https://issues.apache.org/jira/browse/FLINK-16637 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: Zili Chen >Priority: Major > Fix For: 1.10.1, 1.11.0 > > Attachments: yarn.log > > > The {{MiniDispatcher}} no longer waits until the REST handler has served the > cancellation result before shutting down. This behaviour seems to be > introduced with FLINK-15116. > See also > [https://lists.apache.org/x/thread.html/rcadbd6ceede422bac8d4483fd0cdae58659fbff78533a399eb136743@%3Cdev.flink.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-16637) Flink per job mode terminates before serving job cancellation result
[ https://issues.apache.org/jira/browse/FLINK-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069226#comment-17069226 ] Zili Chen edited comment on FLINK-16637 at 3/28/20, 4:54 AM: - I think I find the problem, we register the cancel handler twice so that we close it twice, while the phaser guardian is {{1}} we escape the guardian before outstanding respond finished. org/apache/flink/runtime/webmonitor/WebMonitorEndpoint.java:589 handlers.add(Tuple2.of(jobCancelTerminationHandler.getMessageHeaders(), jobCancelTerminationHandler)); org/apache/flink/runtime/webmonitor/WebMonitorEndpoint.java:597 handlers.add(Tuple2.of(YarnCancelJobTerminationHeaders.getInstance(), jobCancelTerminationHandler)); was (Author: tison): I think I find the problem, we register the cancel handler twice so that we close it twice org/apache/flink/runtime/webmonitor/WebMonitorEndpoint.java:589 handlers.add(Tuple2.of(jobCancelTerminationHandler.getMessageHeaders(), jobCancelTerminationHandler)); org/apache/flink/runtime/webmonitor/WebMonitorEndpoint.java:597 handlers.add(Tuple2.of(YarnCancelJobTerminationHeaders.getInstance(), jobCancelTerminationHandler)); > Flink per job mode terminates before serving job cancellation result > > > Key: FLINK-16637 > URL: https://issues.apache.org/jira/browse/FLINK-16637 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: Zili Chen >Priority: Major > Fix For: 1.10.1, 1.11.0 > > Attachments: yarn.log > > > The {{MiniDispatcher}} no longer waits until the REST handler has served the > cancellation result before shutting down. This behaviour seems to be > introduced with FLINK-15116. > See also > [https://lists.apache.org/x/thread.html/rcadbd6ceede422bac8d4483fd0cdae58659fbff78533a399eb136743@%3Cdev.flink.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16637) Flink per job mode terminates before serving job cancellation result
[ https://issues.apache.org/jira/browse/FLINK-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069227#comment-17069227 ] Zili Chen commented on FLINK-16637: --- The solution should be straightforward: we ensure that {{AbstractHandler#closeAsync}} called at most once. Further investigation could try to figure out whether or not we can remove org/apache/flink/runtime/webmonitor/WebMonitorEndpoint.java:597 handlers.add(Tuple2.of(YarnCancelJobTerminationHeaders.getInstance(), jobCancelTerminationHandler)); as there is a 2 years' TODO > Flink per job mode terminates before serving job cancellation result > > > Key: FLINK-16637 > URL: https://issues.apache.org/jira/browse/FLINK-16637 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: Zili Chen >Priority: Major > Fix For: 1.10.1, 1.11.0 > > Attachments: yarn.log > > > The {{MiniDispatcher}} no longer waits until the REST handler has served the > cancellation result before shutting down. This behaviour seems to be > introduced with FLINK-15116. > See also > [https://lists.apache.org/x/thread.html/rcadbd6ceede422bac8d4483fd0cdae58659fbff78533a399eb136743@%3Cdev.flink.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069232#comment-17069232 ] Zili Chen commented on FLINK-16626: --- link to the analysis https://issues.apache.org/jira/browse/FLINK-16637?focusedCommentId=17063164&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17063164 https://issues.apache.org/jira/browse/FLINK-16637?focusedCommentId=17069225&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17069225 https://issues.apache.org/jira/browse/FLINK-16637?focusedCommentId=17069226&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17069226 https://issues.apache.org/jira/browse/FLINK-16637?focusedCommentId=17069227&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17069227 and closed FLINK-16637 as duplication > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) >
[jira] [Closed] (FLINK-16637) Flink per job mode terminates before serving job cancellation result
[ https://issues.apache.org/jira/browse/FLINK-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen closed FLINK-16637. - Fix Version/s: (was: 1.10.1) (was: 1.11.0) Resolution: Duplicate > Flink per job mode terminates before serving job cancellation result > > > Key: FLINK-16637 > URL: https://issues.apache.org/jira/browse/FLINK-16637 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: Zili Chen >Priority: Major > Attachments: yarn.log > > > The {{MiniDispatcher}} no longer waits until the REST handler has served the > cancellation result before shutting down. This behaviour seems to be > introduced with FLINK-15116. > See also > [https://lists.apache.org/x/thread.html/rcadbd6ceede422bac8d4483fd0cdae58659fbff78533a399eb136743@%3Cdev.flink.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-16626: -- Fix Version/s: 1.11.0 1.10.1 > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Fix For: 1.10.1, 1.11.0 > > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > {quote} > Actually, the job was cancelle
[jira] [Commented] (FLINK-16637) Flink per job mode terminates before serving job cancellation result
[ https://issues.apache.org/jira/browse/FLINK-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072312#comment-17072312 ] Zili Chen commented on FLINK-16637: --- [~gjy] >The solution should be straightforward: we ensure that >AbstractHandler#closeAsync called at most once. Make sense. >I am afraid that this is not possible since we are also supporting old Hadoop >versions. As [~fly_in_gis] comment on FLINK-16626 I got the point. Let's keep discussion in FLINK-16626 since this one closed as duplicated. > Flink per job mode terminates before serving job cancellation result > > > Key: FLINK-16637 > URL: https://issues.apache.org/jira/browse/FLINK-16637 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: Zili Chen >Priority: Major > Attachments: yarn.log > > > The {{MiniDispatcher}} no longer waits until the REST handler has served the > cancellation result before shutting down. This behaviour seems to be > introduced with FLINK-15116. > See also > [https://lists.apache.org/x/thread.html/rcadbd6ceede422bac8d4483fd0cdae58659fbff78533a399eb136743@%3Cdev.flink.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen updated FLINK-16626: -- Priority: Blocker (was: Major) > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Assignee: Weike Dong >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > {quote} > Actually,
[jira] [Assigned] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen reassigned FLINK-16626: - Assignee: Weike Dong > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Assignee: Weike Dong >Priority: Major > Fix For: 1.10.1, 1.11.0 > > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > {quote} > Actually, the jo
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072315#comment-17072315 ] Zili Chen commented on FLINK-16626: --- [~kyledong] Thanks for your interest. I've assigned to you. [~liyu] I raise this issue as blocker for 1.10.1 since it broke user interface({{flink cancel}}) and it's hopeful we fix it in days. Comment if you have any concern. > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Assignee: Weike Dong >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.Comple
[jira] [Commented] (FLINK-16915) Cannot find compatible factory for specified execution.target (=local)
[ https://issues.apache.org/jira/browse/FLINK-16915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072628#comment-17072628 ] Zili Chen commented on FLINK-16915: --- I think it is fixed by FLINK-16834, please take a look. > Cannot find compatible factory for specified execution.target (=local) > -- > > Key: FLINK-16915 > URL: https://issues.apache.org/jira/browse/FLINK-16915 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.11.0 >Reporter: Jaryzhen >Priority: Minor > > It occurred when I run from newly version. > {code:java} > Exception in thread "main" java.lang.NullPointerException: Cannot find > compatible factory for specified execution.target (=local) > at > org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:104) > at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1750) > at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1658) > at > org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:74) > at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1644) > at > org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:684) > at > org.apache.flink.streaming.scala.examples.wordcount.WordCount$.main(WordCount.scala:89) > at > org.apache.flink.streaming.scala.examples.wordcount.WordCount.main(WordCount.scala) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-16737) Remove KubernetesUtils#getContentFromFile
[ https://issues.apache.org/jira/browse/FLINK-16737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen closed FLINK-16737. - Resolution: Fixed master(1.11) via 0622ef7cac7d4e882e6a0f9df137e06400053e93 > Remove KubernetesUtils#getContentFromFile > - > > Key: FLINK-16737 > URL: https://issues.apache.org/jira/browse/FLINK-16737 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Trivial > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Since {{org.apache.flink.util.FileUtils}} has already provided some utilities > such as {{readFile}} or {{readFileUtf8}} for reading file contents, we can > remove the {{KubernetesUtils#getContentFromFile}} and use the {{FileUtils}} > tool instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16737) Remove KubernetesUtils#getContentFromFile
[ https://issues.apache.org/jira/browse/FLINK-16737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073330#comment-17073330 ] Zili Chen commented on FLINK-16737: --- [~csbliss] the next time you're interested in a ticket, you can ping me to help with assignment. > Remove KubernetesUtils#getContentFromFile > - > > Key: FLINK-16737 > URL: https://issues.apache.org/jira/browse/FLINK-16737 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Trivial > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Since {{org.apache.flink.util.FileUtils}} has already provided some utilities > such as {{readFile}} or {{readFileUtf8}} for reading file contents, we can > remove the {{KubernetesUtils#getContentFromFile}} and use the {{FileUtils}} > tool instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-16749) Support to set node selector for JM/TM pod
[ https://issues.apache.org/jira/browse/FLINK-16749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen closed FLINK-16749. - Resolution: Fixed master(1.11) via 5b54a2580172a7f3c518211c1bfc5f987ebbadba > Support to set node selector for JM/TM pod > --- > > Key: FLINK-16749 > URL: https://issues.apache.org/jira/browse/FLINK-16749 > Project: Flink > Issue Type: Sub-task > Components: Deployment / Kubernetes >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The > [node-selector|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/] > is a collection of key/value pairs to constrain a pod to only be able to run > on particular node(s). Since affinity and anti-affinity are uncommon use case > for Flink, so we leave the support in pod template. > > {code:java} > public static final ConfigOption> > JOB_MANAGER_NODE_SELECTOR = > key("kubernetes.jobmanager.node-selector") > .mapType() > .noDefaultValue() > .withDescription("The node selector to be set for JobManager pod. Specified > as key:value pairs separated by " + > "commas. For example, environment:production,disk:ssd."); > public static final ConfigOption> > TASK_MANAGER_NODE_SELECTOR = > key("kubernetes.taskmanager.node-selector") > .mapType() > .noDefaultValue() > .withDescription("The node selector to be set for TaskManager pods. > Specified as key:value pairs separated by " + > "commas. For example, environment:production,disk:ssd."); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)