I think i found the issue. I'd just like to verify that my reasoning is correct
We had the following keys in our flink-conf.yaml jobmanager.web.address: localhost jobmanager.web.port: 8081 This worked on flink 1.3.2 But on flink 1.4.0 this check https://github.com/apache/flink/blob/32770103253e01cd61c8634378cfa1b26707e19a/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/util/HandlerRedirectUtils.java#L62 Will make make it so that both master and standby think that they don't need to perform a redirect. Which means that the standby node will serve web traffic. I am assuming that it is intended that this never happens. (because if will call remote actor systems) so this class not being serializable is not a bug On 16 January 2018 at 14:51, Till Rohrmann <trohrm...@apache.org> wrote: > Hi, > > this indeed indicates that a REST handler is requesting the ExecutionGraph > from a JobManager which does not run in the same ActorSystem. Could you > please tell us the exact HA setup. Are your running Flink on Yarn with HA > or do you use standalone HA with standby JobManagers? > > It would be really helpful if you could also share the logs with us. > > Cheers, > Till > > On Tue, Jan 16, 2018 at 10:20 AM, Nico Kruber <n...@data-artisans.com> > wrote: > >> IMHO, this looks like a bug and it makes sense that you only see this >> with an HA setup: >> >> The JobFound message contains the ExecutionGraph which, however, does >> not implement the Serializable interface. Without HA, when browsing the >> web interface, this message is (probably) not serialized since it is >> only served to you via HTML. For HA, this may come from another >> JobManager than the Web interface you are browsing. >> I'm including Till (cc'd) as he might know more. >> >> >> Nico >> >> On 16/01/18 09:22, jelmer wrote: >> > HI, >> > >> > We recently upgraded our test environment to from flink 1.3.2 to flink >> > 1.4.0. >> > >> > We are using a high availability setup on the job manager. And now often >> > when I go to the job details in the web ui the call will timeout and the >> > following error will pop up in the job manager log >> > >> > >> > akka.remote.MessageSerializer$SerializationException: Failed to >> > serialize remote message [class >> > org.apache.flink.runtime.messages.JobManagerMessages$JobFound] using >> > serializer [class akka.serialization.JavaSerializer]. >> > at akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:61) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at >> > akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply >> (Endpoint.scala:889) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at >> > akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply >> (Endpoint.scala:889) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:888) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at akka.remote.EndpointWriter.writeSend(Endpoint.scala:780) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at akka.remote.EndpointWriter$$anonfun$4.applyOrElse(Endpoint. >> scala:755) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) >> > [flink-dist_2.11-1.4.0.jar:1.4.0] >> > at akka.actor.ActorCell.invoke(ActorCell.scala:495) >> > [flink-dist_2.11-1.4.0.jar:1.4.0] >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) >> > [flink-dist_2.11-1.4.0.jar:1.4.0] >> > at akka.dispatch.Mailbox.run(Mailbox.scala:224) >> > [flink-dist_2.11-1.4.0.jar:1.4.0] >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) >> > [flink-dist_2.11-1.4.0.jar:1.4.0] >> > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> > [flink-dist_2.11-1.4.0.jar:1.4.0] >> > at >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask( >> ForkJoinPool.java:1339) >> > [flink-dist_2.11-1.4.0.jar:1.4.0] >> > at >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPoo >> l.java:1979) >> > [flink-dist_2.11-1.4.0.jar:1.4.0] >> > at >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinW >> orkerThread.java:107) >> > [flink-dist_2.11-1.4.0.jar:1.4.0] >> > Caused by: java.io.NotSerializableException: >> > org.apache.flink.runtime.executiongraph.ExecutionGraph >> > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream. >> java:1184) >> > ~[na:1.8.0_131] >> > at >> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputSt >> ream.java:1548) >> > ~[na:1.8.0_131] >> > at >> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStrea >> m.java:1509) >> > ~[na:1.8.0_131] >> > at >> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputS >> tream.java:1432) >> > ~[na:1.8.0_131] >> > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream. >> java:1178) >> > ~[na:1.8.0_131] >> > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) >> > ~[na:1.8.0_131] >> > at >> > akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$ >> mcV$sp(Serializer.scala:321) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at >> > akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply( >> Serializer.scala:321) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at >> > akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply( >> Serializer.scala:321) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at akka.serialization.JavaSerializer.toBinary(Serializer.scala:321) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > at akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:47) >> > ~[flink-dist_2.11-1.4.0.jar:1.4.0] >> > ... 17 common frames omitted >> > >> > >> > >> > I isolated it further, and it seems to be triggered by this call >> > >> > https://hostname/jobs/28076fffbcf7eab3f17900a54cc7c41d >> > >> > I cannot reproduce it on my local lapop without HA setup. >> > Before I dig any deeper, has anyone already come across this ? >> >> >