Hi Yang,
Thanks a lot for the information!
Eleanore
On Thu, Aug 6, 2020 at 4:20 AM Yang Wang wrote:
> Hi Eleanore,
>
> From my experience, collecting the Flink metrics to prometheus via metrics
> collector is a more ideal way. It is
> also easier to configure the alert.
> Maybe you could use "
Hi Eleanore,
>From my experience, collecting the Flink metrics to prometheus via metrics
collector is a more ideal way. It is
also easier to configure the alert.
Maybe you could use "fullRestarts" or "numRestarts" to monitor the job
restarting. More metrics could be find
here[2].
[1].
https://ci.
Hi Yang and Till,
Thanks a lot for the help! I have the similar question as Till mentioned,
if we do not fail Flink pods when the restart strategy is exhausted, it
might be hard to monitor such failures. Today I get alerts if the k8s pods
are restarted or in crash loop, but if this will no longer
You are right Yang Wang.
Thanks for creating this issue.
Cheers,
Till
On Wed, Aug 5, 2020 at 1:33 PM Yang Wang wrote:
> Actually, the application status shows in YARN web UI is not determined by
> the jobmanager process exit code.
> Instead, we use "resourceManagerClient.unregisterApplicationM
Actually, the application status shows in YARN web UI is not determined by
the jobmanager process exit code.
Instead, we use "resourceManagerClient.unregisterApplicationMaster" to
control the final status of YARN application.
So although jobmanager exit with zero code, it still could show failed
st
Yes for the other deployments it is not a problem. A reason why people
preferred non-zero exit codes in case of FAILED jobs is that this is easier
to monitor than having to take a look at the actual job result. Moreover,
in the YARN web UI the application shows as failed if I am not mistaken.
Howev
Hi Eleanore,
Yes, I suggest to use Job to replace Deployment. It could be used to run
jobmanager one time and finish after a successful/failed completion.
However, using Job still could not solve your problem completely. Just as
Till said, When a job exhausts the restart strategy, the jobmanager
Hi Yang & Till,
Thanks for your prompt reply!
Yang, regarding your question, I am actually not using k8s job, as I put my
app.jar and its dependencies under flink's lib directory. I have 1 k8s
deployment for job manager, and 1 k8s deployment for task manager, and 1
k8s service for job manager.
A
@Till Rohrmann In native mode, when a Flink
application terminates with FAILED state, all the resources will be cleaned
up.
However, in standalone mode, I agree with you that we need to rethink the
exit code of Flink. When a job exhausts the restart
strategy, we should terminate the pod and do no
@Yang Wang I believe that we should rethink the
exit codes of Flink. In general you want K8s to restart a failed Flink
process. Hence, an application which terminates in state FAILED should not
return a non-zero exit code because it is a valid termination state.
Cheers,
Till
On Tue, Aug 4, 2020
Hi Eleanore,
I think you are using K8s resource "Job" to deploy the jobmanager. Please
set .spec.template.spec.restartPolicy = "Never" and spec.backoffLimit = 0.
Refer here[1] for more information.
Then, when the jobmanager failed because of any reason, the K8s job will be
marked failed. And K8s
Hi Till,
Thanks for the reply!
I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
Specifically, I build a custom docker image, which I copied the app jar
(not uber jar) and all its dependencies under /flink/lib.
So my question is more like, in this case, if the job is marked as FA
Hi Eleanore,
how are you deploying Flink exactly? Are you using the application mode
with native K8s support to deploy a cluster [1] or are you manually
deploying a per-job mode [2]?
I believe the problem might be that we terminate the Flink process with a
non-zero exit code if the job reaches th
Hi Experts,
I have a flink cluster (per job mode) running on kubernetes. The job is
configured with restart strategy
restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s
So after 3 times retry, the job will be marked as FAILED, hence the pods
are not running. However
14 matches
Mail list logo