Great, thanks Klou!
Cheers,
Till
On Mon, Sep 28, 2020 at 5:07 PM Kostas Kloudas wrote:
> Hi all,
>
> I will have a look.
>
> Kostas
>
> On Mon, Sep 28, 2020 at 3:56 PM Till Rohrmann
> wrote:
> >
> > Hi Cristian,
> >
> > thanks for reporting this issue. It looks indeed like a very critical
> pr
Hi all,
I will have a look.
Kostas
On Mon, Sep 28, 2020 at 3:56 PM Till Rohrmann wrote:
>
> Hi Cristian,
>
> thanks for reporting this issue. It looks indeed like a very critical problem.
>
> The problem seems to be that the ApplicationDispatcherBootstrap class
> produces an exception (that th
Hi Cristian,
thanks for reporting this issue. It looks indeed like a very critical
problem.
The problem seems to be that the ApplicationDispatcherBootstrap class
produces an exception (that the request job can no longer be found because
of a lost ZooKeeper connection) which will be interpreted as
> The job sub directory will be cleaned up when the job
finished/canceled/failed.
Since we could submit multiple jobs into a Flink session, what i mean is
when a job
reached to the terminal state, the sub node(e.g.
/flink/application_/running_job_registry/4d255397c7aeb5327adb567238c983c1)
on th
> The job sub directory will be cleaned up when the job
> finished/canceled/failed.
What does this mean?
Also, to clarify: I'm a very sloppy developer. My jobs crash ALL the time...
and yet, the jobs would ALWAYS resume from the last checkpoint.
The only cases where I expect Flink to clean u
AFAIK, the HA data, including Zookeeper meta data and real data on DFS,
will only be cleaned up
when the Flink cluster reached terminated state.
So if you are using a session cluster, the root cluster node on Zk will be
cleaned up after you manually
stop the session cluster. The job sub directory
I'm using the standalone script to start the cluster.
As far as I can tell, it's not easy to reproduce. We found that zookeeper lost
a node around the time this happened, but all of our other 75 Flink jobs which
use the same setup, version and zookeeper, didn't have any issues. They didn't
eve
Thanks a lot for reporting this problem here Cristian!
I am not super familiar with the involved components, but the behavior you
are describing doesn't sound right to me.
Which entrypoint are you using? This is logged at the beginning, like this:
"2020-09-08 14:45:32,807 INFO
org.apache.flink.ru
Hi Cristian,
I don't know if it was designed to be like this deliberately.
So I have already submitted an issue ,and wait for somebody to response.
https://issues.apache.org/jira/browse/FLINK-19154
--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
That's an excellent question. I can't explain that. All I know is this:
- the job was upgraded and resumed from a savepoint
- After hours of working fine, it failed (like it shows in the logs)
- the Metadata was cleaned up, again as shown in the logs
- because I run this in Kubernetes, the conta
I means that checkpoints are usually dropped after the job was terminated by
the user (except if explicitly configured as retained Checkpoints). You
could use "ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION" to save
your checkpoint when te cames to failure.
When your zookeeper lost connect
> If you want to save your checkPoint,you could refer to this document
What do you mean? We already persist our savepoints, and we do not delete them
explicitly ever.
The problem is that Flink deleted the data from zookeeper when it shouldn't
have. Is it possible to start a job from a checkpo
Hi Cristian,
>From this code , we could see that the Exception or Error was ignored in
dispatcher.shutDownCluster(applicationStatus) .
``
org.apache.flink.runtime.dispatcher.DispatcherGateway#shutDownCluster
return applicationCompletionFuture
.handle((r, t) -> {
My suspicion is that somewhere in the path were it fails to connect yo
zookeeper, the exception is swallowed, so instead of running the shutdown path
for when the job fails, the general shutdown path is taken.
This was fortunately a job for which we had a savepoint from yesterday.
Otherwise
Hi Cristian,
In the log,we can see it went to the method
shutDownAsync(applicationStatus,null,true);
``
2020-09-04 17:32:07,950 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Shutting
StandaloneApplicationClusterEntryPoint down w
Hello guys.
We run a stand-alone cluster that runs a single job (if you are familiar with
the way Ververica Platform runs Flink jobs, we use a very similar approach). It
runs Flink 1.11.1 straight from the official docker image.
Usually, when our jobs crash for any reason, they will resume from
16 matches
Mail list logo