Hi Vishal,
you should not need to configure anything else.
Cheers,
Till
On Sat, Jun 30, 2018 at 7:23 PM Vishal Santoshi
wrote:
> A clarification.. In 1.5 with custom heartbeats are there additional
> configurations we should be concerned about ?
>
> On Fri, May 25, 2018 at 10:17 AM, Steven Wu
A clarification.. In 1.5 with custom heartbeats are there additional
configurations we should be concerned about ?
On Fri, May 25, 2018 at 10:17 AM, Steven Wu wrote:
> Till, thanks for the follow-up. looking forward to 1.5 :)
>
> On Fri, May 25, 2018 at 2:11 AM, Till Rohrmann
> wrote:
>
>> Hi S
Till, thanks for the follow-up. looking forward to 1.5 :)
On Fri, May 25, 2018 at 2:11 AM, Till Rohrmann wrote:
> Hi Steven,
>
> we don't have `jobmanager.exit-on-fatal-akka-error` because then the JM
> would also be killed if a single TM gets quarantined. This is also not a
> desired behaviour.
Hi Steven,
we don't have `jobmanager.exit-on-fatal-akka-error` because then the JM
would also be killed if a single TM gets quarantined. This is also not a
desired behaviour.
With Flink 1.5 the problem with quarantining should be gone since we don't
rely anymore on Akka's death watch and instead
Till,
thanks for the clarification. yes, that situation is undesirable either.
In our case, restarting jobmanager could also recover the job from akk
association lock-out. it was actually the issue (high GC pause) on
jobmanager side that caused the akka failure.
do we have sth like "jobmanager.e
Hi Steven,
the reason why we did not turn on this feature per default was that in case
of a true JM failure, all of the TMs will think that they got quarantined
which triggers their shut down. Depending on how many container restarts
you have left on Yarn, for example, this can lead to a situation
Till,
We ran into the same issue. It started with high GC pause that caused
jobmanager to lose zk conn and leadership and caused jobmanager to
quarantine taskmanager in akka. Once quarantined, akka association btw
jobmanager and taskmanager is locked forever.
Your suggestion of " taskmanager.exit
@Jelmer, this is Till's las response on the issue.
-- Ashish
On Mon, Feb 5, 2018 at 5:56 AM, Till Rohrmann wrote:
Hi,
this sounds like a serious regression wrt Flink 1.3.2 and we should definitely
find out what's causing this problem. Given from what I see in the logs, the
following happ
Hi Till,
Thanks for detailed response. I will try to gather some of this information
during the week and follow up.
— Ashish
> On Feb 5, 2018, at 5:55 AM, Till Rohrmann wrote:
>
> Hi,
>
> this sounds like a serious regression wrt Flink 1.3.2 and we should
> definitely find out what's causin
Hi,
this sounds like a serious regression wrt Flink 1.3.2 and we should
definitely find out what's causing this problem. Given from what I see in
the logs, the following happens:
For some time the JobManager seems to no longer receive heartbeats from the
TaskManager. This could be, for example, d
I've seen a similar issue while running successive Flink SQL batches on
1.4. In my case, the Job Manager would fail with the log output about
unreachability (with an additional statement about something going
"horribly wrong"). Under workload pressure, I reverted to 1.3.2 where
everything works per
I haven’t gotten much further with this. It doesn’t look like GC related - at
least GC counters were not that atrocious. However, my main concern was once
the load subsides why aren’t TM and JM connecting again? That doesn’t look
normal. I could definitely tell JM was listening on the port and f
Hi.
Did you find a reason for the detaching ?
I sometimes see the same on our system running Flink 1.4 on dc/os. I have
enabled taskmanager.Debug.memory.startlogthread for debugging.
Med venlig hilsen / Best regards
Lasse Nedergaard
> Den 20. jan. 2018 kl. 12.57 skrev Kien Truong :
>
> Hi,
Hi,
You should enable and check your garbage collection log.
We've encountered case where Task Manager disassociated due to long GC
pause.
Regards,
Kien
On 1/20/2018 1:27 AM, ashish pok wrote:
Hi All,
We have hit some load related issues and was wondering if any one has
some suggestions
Thanks for this message. We also experience very similar issue under a
heavy load. In job manager logs we see AskTimeoutExceptions. This
correlates typicaly with almost 100% cpu in tak manager. Even if the job is
stopped task manger is still busy for minutes or even hour acting like in
`saturation`
Hi All,
We have hit some load related issues and was wondering if any one has some
suggestions. We are noticing task managers and job managers being detached from
each other under load and never really sync up again. As a result, Flink
session shows 0 slots available for processing. Even though,
16 matches
Mail list logo