Hello guys. Happy new year!

Context: we started to have some troubles with UI after bumping our Flink 
version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting 
of the jobs for us has become impossible with the new version.

And looks like we have a workaround for our UI issue.
After some investigation we realized that starting from Flink 1.5 version we 
started to have a timeout on the actor call: restfulGateway.requestJob(jobId, 
timeout) in ExecutionGraphCache. So we have increased web.timeout parameter and 
we have stopped to have timeout exception on the JobManager side.

Also in SingleJobController on the Angular JS side we needed to tweak 
web.refresh-interval in order to ensure that Front-End is waiting for back-end 
request to be finished. Otherwise Angular JS side can make another request in 
SingleJobController and don’t know why when older request is finished no UI has 
been changed. We will have a look closer on this behavior.

Does it ring a bell for you probably?

Thank you

Kind Regards
Oleksandr

From: Till Rohrmann <trohrm...@apache.org>
Date: Wednesday 19 December 2018 at 16:52
To: Juan Gentile <j.gent...@criteo.com>
Cc: "dwysakow...@apache.org" <dwysakow...@apache.org>, Jeff Bean 
<j...@data-artisans.com>, Oleksandr Nitavskyi <o.nitavs...@criteo.com>
Subject: Re: 1.6 UI issues

Hi Juan,

thanks for the log. The log file does not contain anything suspicious. Are you 
sure that you sent me the right file? The timestamps don't seem to match. In 
the attached log, the job seems to run without problems.

Cheers,
Till

On Wed, Dec 19, 2018 at 10:26 AM Juan Gentile 
<j.gent...@criteo.com<mailto:j.gent...@criteo.com>> wrote:

Hello Till, Dawid
Sorry for the late response on this issue and thank you Jeff for helping us 
with this.
Yes we are using 1.6.2
I attach the logs from the Job Master.
Also we noticed this exception:
2018-12-19 08:50:10,497 ERROR 
org.apache.flink.runtime.rest.handler.job.JobDetailsHandler   - Implementation 
error: Unhandled exception.
java.util.concurrent.CancellationException
    at 
java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2263)
    at 
org.apache.flink.runtime.rest.handler.legacy.ExecutionGraphCache.getExecutionGraph(ExecutionGraphCache.java:124)
    at 
org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.handleRequest(AbstractExecutionGraphHandler.java:76)
    at 
org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:78)
    at 
org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:154)
    at 
org.apache.flink.runtime.rest.handler.RedirectHandler.lambda$null$0(RedirectHandler.java:142)
    at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
    at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
    at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
    at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
    at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
    at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)
2018-12-19 08:50:17,977 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Implementation 
error: Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms]. Sender[null] 
sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
    at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
    at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
    at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
    at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
    at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
    at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
    at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
    at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
    at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
    at java.lang.Thread.run(Thread.java:748)

For which we tested with this parameter: -Dakka.ask.timeout=60s
But the issue remains.

Thank you
Juan

From: Till Rohrmann <trohrm...@apache.org<mailto:trohrm...@apache.org>>
Date: Thursday, 8 November 2018 at 16:06
To: "dwysakow...@apache.org<mailto:dwysakow...@apache.org>" 
<dwysakow...@apache.org<mailto:dwysakow...@apache.org>>
Cc: Juan Gentile <j.gent...@criteo.com<mailto:j.gent...@criteo.com>>, 
"myas...@live.com<mailto:myas...@live.com>" 
<myas...@live.com<mailto:myas...@live.com>>, user 
<user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: Re: 1.6 UI issues

Hi Juan,

could you share the cluster entrypoint logs with us? They should contain more 
information about the internal server error.

Just to make sure, you are using Flink 1.6.2, right?

Cheers,
Till

On Thu, Nov 8, 2018 at 3:29 PM Dawid Wysakowicz 
<dwysakow...@apache.org<mailto:dwysakow...@apache.org>> wrote:

Hi Juan,

It doesn't look similar to the issue linked to me. What cluster setup are you 
using? Are you running HA mode?

I am adding Till to cc, who might be able to help you more.

Best,

Dawid
On 02/11/2018 17:26, Juan Gentile wrote:
Hello Yun,

We haven’t seen the error in the log as you mentioned. We also checked the GC 
and it seems to be okay. Inspecting the UI we found the following error:
Error! Filename not specified.

{"errors":["Could not retrieve the redirect address of the current leader. 
Please try to refresh."]}
Error! Filename not specified.

We suspect we are running into the same issue as described here 
(http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_akka-2Dtimeout-2Dtd14996.html&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=hidCcXGD2aiyfZuADm1v4XCzvKL2Rsww7WJWxETtxRY&s=zMBuP5aTcdQ5VMavXw1dGvz72efTyTSq6tpbFcPSHxU&e=>)
 but we are not so sure.

Have you encountered this issue before?

Thank you,

From: Yun Tang <myas...@live.com><mailto:myas...@live.com>
Date: Thursday, 1 November 2018 at 12:31
To: Juan Gentile <j.gent...@criteo.com><mailto:j.gent...@criteo.com>, 
"user@flink.apache.org"<mailto:user@flink.apache.org> 
<user@flink.apache.org><mailto:user@flink.apache.org>
Subject: Re: 1.6 UI issues

Hi Juan

From our experience, you could check the jobmanager.log first to see whether 
existing similar logs below:

max allowed size 128000 bytes, actual size of encoded class 
akka.actor.Status$Success was xxx bytes



If you see these logs, you should increase the akka.framesize to larger value 
(default value is '10485760b') [1].



Otherwise, you could check the gc-log of job manager to see whether the gc 
overhead is too heavy for your job manager, consider to increase the memory for 
your job manager if so.
[1] 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
Apache Flink 1.6 Documentation: 
Configuration<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23distributed-2Dcoordination-2Dvia-2Dakka&d=DwMFAg&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=WJfyB8WEQI3uCNuLBz9E_TtrMgZAFOjY_tUhvp3iiNs&s=kf9PiKegay46-MmjlmWIQiIwY0_EhOZqMZooObLId_Y&e=>
Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the 
JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, 
which are the parallel workers of the system.
ci.apache.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__ci.apache.org&d=DwMFaQ&c=nxfEpP1JWHVKAq835DW4mA&r=z5BFHEFwsu2ghSzcXn1_8T3-VzeesIO2aULbUy2urus&m=hidCcXGD2aiyfZuADm1v4XCzvKL2Rsww7WJWxETtxRY&s=JGgSIUxh1k57R0OSjnAG8GxwbzWUo6MFercAY-3JL3k&e=>

Best
Yun Tang

________________________________
From: Juan Gentile <j.gent...@criteo.com><mailto:j.gent...@criteo.com>
Sent: Wednesday, October 31, 2018 22:05
To: user@flink.apache.org<mailto:user@flink.apache.org>
Subject: 1.6 UI issues


Hello!



We are migrating the the last 1.6 version and all the jobs seem to work fine, 
but when we check individual jobs through the web interface we encounter the 
issue that after clicking on a job, either it takes too long to load the 
information of the job or it never loads at all.



Has anyone had this issue? Any clues as to why?



Thank you,

Juan

Reply via email to