[jira] [Comment Edited] (FLINK-19154) Application mode deletes HA data in case of suspended ZooKeeper connection

Till Rohrmann (Jira) Wed, 28 Oct 2020 01:57:25 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222045#comment-17222045
 ]


Till Rohrmann edited comment on FLINK-19154 at 10/28/20, 8:56 AM:
------------------------------------------------------------------

Thanks for trying this fix out [~casidiablo] and happy to hear that it solved 
your problem.

If you want to use Flink's snapshot artifacts, then you have to add

{code}
<repositories>
  <repository>
    <id>snapshot</id>
    <name>Apache Snapshot repository</name>
    <url>https://repository.apache.org/content/repositories/snapshots/</url>
    <snapshotPolicy>always</snapshotPolicy>
  </repository>
</repositories>
{code}

to your {{pom.xml}}.

If you do not bundle Flink dependencies with your user jar which you put into 
your image, then it should actually not be necessary to recompile the user jar 
(unless we introduced an incompatible change with Flink 1.12).



was (Author: till.rohrmann):
Thanks for trying this fix out [~casidiablo] and happy to hear that it solved 
your problem.

If you want to use Flink's snapshot artifacts, then you have to add

{code}
<repositories>
  <repository>
    <id>snapshot</id>
    <name>Apache Snapshot repository</name>
    <url>https://repository.apache.org/content/repositories/snapshots/</url>
    <snapshotPolicy>always</snapshotPolicy>
  </repository>
</repositories>
{code}

to your {pom.xml}.

If you do not bundle Flink dependencies with your user jar which you put into 
your image, then it should actually not be necessary to recompile the user jar 
(unless we introduced an incompatible change with Flink 1.12).


> Application mode deletes HA data in case of suspended ZooKeeper connection
> --------------------------------------------------------------------------
>
>                 Key: FLINK-19154
>                 URL: https://issues.apache.org/jira/browse/FLINK-19154
>             Project: Flink
>          Issue Type: Bug
>          Components: Client / Job Submission
>    Affects Versions: 1.12.0, 1.11.1
>         Environment: Run a stand-alone cluster that runs a single job (if you 
> are familiar with the way Ververica Platform runs Flink jobs, we use a very 
> similar approach). It runs Flink 1.11.1 straight from the official docker 
> image.
>            Reporter: Husky Zeng
>            Assignee: Kostas Kloudas
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.12.0, 1.11.3
>
>
> A user reported that Flink's application mode deletes HA data in case of a 
> suspended ZooKeeper connection [1]. 
> The problem seems to be that the {{ApplicationDispatcherBootstrap}} class 
> produces an exception (that the request job can no longer be found because of 
> a lost ZooKeeper connection) which will be interpreted as a job failure. Due 
> to this interpretation, the cluster will be shut down with a terminal state 
> of FAILED which will cause the HA data to be cleaned up. The exact problem 
> occurs in the {{JobStatusPollingUtils.getJobResult}} which is called by 
> {{ApplicationDispatcherBootstrap.getJobResult()}}.
> The above described behaviour can be found in this log [2].
> [1] 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-metadata-deleted-by-Flink-after-ZK-connection-issues-td37937.html
> [2] https://pastebin.com/raw/uH9KDU2L



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-19154) Application mode deletes HA data in case of suspended ZooKeeper connection

Reply via email to