OK this happened again and it is bizarre ( and is definitely not what I
think should happen )
The job failed and I see these logs ( In essence it is keeping the last 5
externalized checkpoints ) but deleting the zk checkpoints directory
*06.28.2019 20:33:13.7382019-06-29 00:33:13,73
Hi Vishal, can this be reproduced on a bare metal instance as well? Also
is this a job or a session cluster?
Thanks
Tim
On Sat, Jun 29, 2019, 7:50 AM Vishal Santoshi
wrote:
> OK this happened again and it is bizarre ( and is definitely not what I
> think should happen )
>
>
>
>
> The job fai
I have not tried on bare metal. We have no option but k8s.
And this is a job cluster.
On Sat, Jun 29, 2019 at 9:01 AM Timothy Victor wrote:
> Hi Vishal, can this be reproduced on a bare metal instance as well? Also
> is this a job or a session cluster?
>
> Thanks
>
> Tim
>
> On Sat, Jun 29, 2
Another point the JM had terminated. The policy on K8s for Job Cluster is
spec:
restartPolicy: OnFailure
*2019-06-29 00:33:14,308 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster
entrypoint process StandaloneJobClusterEntryPoint with exit code 1443.*
On Sat,
So there a re 2 scenerios
1. If JM goes down ( exits ) and k8s re launches the Job Cluster ( the JM
), it is using the exact command, the JM was brought up in the first place.
2. If the pipe is restarted for any other reason by the JM ( the JM has not
exited but *Job Kafka-to-HDFS (
This is slightly off topic, so I'm changing the subject to not conflate the
original issue you brought up. But do we know why JM crashed in the first
place?
We are also thinking of moving to K8s, but to be honest we had tons of
stability issues in our first rodeo. That could just be our lack of
We are investigating that. But is the above theory plausible ( flink
gurus ) even though this, as in forcing restartPolicy: Never pretty much
nullifies HA on JM is it is a Job cluster ( at leats on k8s )
As for the reason we are investigating that.
One thing we looking as the QOS (
https://kub
This is strange, the retry strategy was 20 times with 4 minute delay. This
job tried once ( we had a hadoop Name Node hiccup ) but I think it could
not even get to NN and gave up ( as in did not retry the next 19 times )
*019-06-29 00:33:13,680 INFO
org.apache.flink.runtime.executiongraph.E
even though
Max. number of execution retries Restart with fixed delay (24 ms). #20
restart attempts.
On Sat, Jun 29, 2019 at 10:44 AM Vishal Santoshi
wrote:
> This is strange, the retry strategy was 20 times with 4 minute delay.
> This job tried once ( we had a hadoop Name Node hiccup ) but
Hi Flink experts,
I am prototyping a real time system that reads from Kafka source with Flink
and calls out to an external system as part of the event processing. One of
the most important requirements are read from Kafka should NEVER stall,
even in face of some async external calls slowness while
10 matches
Mail list logo