Hi Gyula,

Is there any news on this?

@Nico or @Gary you recently also did stuff with YARN, do you maybe have an idea 
of what could be going on?

Best,
Aljoscha

> On 21. Nov 2017, at 06:42, Gyula Fóra <gyula.f...@gmail.com> wrote:
> 
> Hi all!
> 
> Today we started noticing that deploying our jobs took over 3 minutes when
> deployed from some machine and normal (few seconds) when deployed from the
> others.
> 
> Looking at the logs it seems that the client cant find some job id for a
> few minutes in this case:
> 
> ...
> 2017-11-21 15:23:00,880 DEBUG org.apache.flink.yarn.YarnJobManager
>                - Job with ID 179d67bfab7c4c0b9f00ea772f6e4f0c not found in
> JobManager
> 2017-11-21 15:23:04,528 DEBUG org.apache.zookeeper.ClientCnxn
>                 - Got ping response for sessionid: 0x25eb8e005b7971b after
> 0ms
> 2017-11-21 15:23:04,636 DEBUG org.apache.hadoop.ipc.Client
>                - IPC Client (937277082) connection to
> splat13.sto.midasplayer.com/172.26.87.155:8030 from splat sending #38
> 2017-11-21 15:23:04,636 DEBUG org.apache.hadoop.ipc.Client
>                - IPC Client (937277082) connection to
> splat13.sto.midasplayer.com/172.26.87.155:8030 from splat got value #38
> 2017-11-21 15:23:04,651 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine
>                 - Call: allocate took 16ms
> 2017-11-21 15:23:05,880 DEBUG org.apache.flink.yarn.YarnJobManager
>                - Job with ID 179d67bfab7c4c0b9f00ea772f6e4f0c not found in
> JobManager
> 2017-11-21 15:23:06,409 DEBUG akka.remote.RemoteWatcher
>                 - Sending Heartbeat to [akka.tcp://
> fl...@splat33.sto.midasplayer.com:56045]
> 2017-11-21 15:23:06,413 DEBUG akka.remote.RemoteWatcher
>                 - Received heartbeat rsp from [akka.tcp://
> fl...@splat33.sto.midasplayer.com:56045]
> 2017-11-21 15:23:07,665 DEBUG
> akka.serialization.Serialization(akka://flink)                - Using
> serializer[akka.serialization.JavaSerializer] for message
> [org.apache.flink.runtime.clusterframework.messages.GetClusterStatusResponse]
> 2017-11-21 15:23:07,824 INFO  org.apache.flink.yarn.YarnJobManager
>                - Submitting job 179d67bfab7c4c0b9f00ea772f6e4f0c
> (event-bifrost-log).
> 2017
> 
> Interestingly enough nothing like this shows when deployed from other
> servers.
> We suspect there might be some strange network issue (which doesnt seem to
> affect jar upload times) that screws with akka in some way.
> 
> Any idea how to debug this?
> Thank you!
> 
> Gyula

Reply via email to