Re: Standalone cluster instability

Alexander Smirnov Wed, 21 Mar 2018 08:01:24 -0700

One more question - I see a lot of line like the following in the logs

[2018-03-21 00:30:35,975] ERROR Association to [akka.tcp://
fl...@qafdsflinkw811.nn.five9lab.com:35320] with UID [1500204560]
irrecoverably failed. Quarantining address. (akka.remote.Remoting)
[2018-03-21 00:34:15,208] WARN Association to [akka.tcp://
fl...@qafdsflinkw811.nn.five9lab.com:41068] with unknown UID is
irrecoverably failed. Address cannot be quarantined without knowing the
UID, gating instead for 5000 ms. (akka.remote.Remoting)
[2018-03-21 00:34:15,235] WARN Association to [akka.tcp://
fl...@qafdsflinkw811.nn.five9lab.com:40677] with unknown UID is
irrecoverably failed. Address cannot be quarantined without knowing the
UID, gating instead for 5000 ms. (akka.remote.Remoting)
[2018-03-21 00:34:15,256] WARN Association to [akka.tcp://
fl...@qafdsflinkw811.nn.five9lab.com:40382] with unknown UID is
irrecoverably failed. Address cannot be quarantined without knowing the
UID, gating instead for 5000 ms. (akka.remote.Remoting)
[2018-03-21 00:34:15,256] WARN Association to [akka.tcp://
fl...@qafdsflinkw811.nn.five9lab.com:44744] with unknown UID is
irrecoverably failed. Address cannot be quarantined without knowing the
UID, gating instead for 5000 ms. (akka.remote.Remoting)
[2018-03-21 00:34:15,266] WARN Association to [akka.tcp://
fl...@qafdsflinkw811.nn.five9lab.com:42413] with unknown UID is
irrecoverably failed. Address cannot be quarantined without knowing the
UID, gating instead for 5000 ms. (akka.remote.Remoting)



The host is available, but I don't understand where port number comes from.
Task Manager uses another port (which is printed in logs on startup)
Could you please help to understand why it happens?

Thank you,
Alex


On Wed, Mar 21, 2018 at 4:19 PM Alexander Smirnov <
alexander.smirn...@gmail.com> wrote:

> Hello,
>
> I've assembled a standalone cluster of 3 task managers and 3 job
> managers(and 3 ZK) following the instructions at
>
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/cluster_setup.html
>  and
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/jobmanager_high_availability.html
>
> It works ok, but randomly, task managers becomes unavailable. JobManager
> has exception like below in logs:
>
>
> [2018-03-19 00:33:10,211] WARN Association with remote system [akka.tcp://
> fl...@qafdsflinkw811.nn.five9lab.com:42413] has failed, address is now
> gated for [5000] ms. Reason: [Association failed with [akka.tcp://
> fl...@qafdsflinkw811.nn.five9lab.com:42413]] Caused by: [Connection
> refused: qafdsflinkw811.nn.five9lab.com/10.5.61.124:42413]
> (akka.remote.ReliableDeliverySupervisor)
> [2018-03-21 00:30:35,975] ERROR Association to [akka.tcp://
> fl...@qafdsflinkw811.nn.five9lab.com:35320] with UID [1500204560]
> irrecoverably failed. Quarantining address. (akka.remote.Remoting)
> java.util.concurrent.TimeoutException: Remote system has been silent for
> too long. (more than 48.0 hours)
>         at
> akka.remote.ReliableDeliverySupervisor$$anonfun$idle$1.applyOrElse(Endpoint.scala:375)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>         at
> akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> I can't find a reason for this exception, any ideas?
>
> Thank you,
> Alex
>

Re: Standalone cluster instability

Reply via email to