Re: Standalone cluster instability

Piotr Nowojski Wed, 21 Mar 2018 08:23:22 -0700

Hi,

Does the issue really happen after 48 hours? 
Is there some indication of a failure in TaskManager log?


If you will be still unable to solve the problem, please provide full 
TaskManager and JobManager logs.

Piotrek

> On 21 Mar 2018, at 16:00, Alexander Smirnov <alexander.smirn...@gmail.com> 
> wrote:
> 
> One more question - I see a lot of line like the following in the logs
> 
> [2018-03-21 00:30:35,975] ERROR Association to 
> [akka.tcp://fl...@qafdsflinkw811.nn.five9lab.com:35320 
> <http://fl...@qafdsflinkw811.nn.five9lab.com:35320/>] with UID [1500204560] 
> irrecoverably failed. Quarantining address. (akka.remote.Remoting)
> [2018-03-21 00:34:15,208] WARN Association to 
> [akka.tcp://fl...@qafdsflinkw811.nn.five9lab.com:41068 
> <http://fl...@qafdsflinkw811.nn.five9lab.com:41068/>] with unknown UID is 
> irrecoverably failed. Address cannot be quarantined without knowing the UID, 
> gating instead for 5000 ms. (akka.remote.Remoting)
> [2018-03-21 00:34:15,235] WARN Association to 
> [akka.tcp://fl...@qafdsflinkw811.nn.five9lab.com:40677 
> <http://fl...@qafdsflinkw811.nn.five9lab.com:40677/>] with unknown UID is 
> irrecoverably failed. Address cannot be quarantined without knowing the UID, 
> gating instead for 5000 ms. (akka.remote.Remoting)
> [2018-03-21 00:34:15,256] WARN Association to 
> [akka.tcp://fl...@qafdsflinkw811.nn.five9lab.com:40382 
> <http://fl...@qafdsflinkw811.nn.five9lab.com:40382/>] with unknown UID is 
> irrecoverably failed. Address cannot be quarantined without knowing the UID, 
> gating instead for 5000 ms. (akka.remote.Remoting)
> [2018-03-21 00:34:15,256] WARN Association to 
> [akka.tcp://fl...@qafdsflinkw811.nn.five9lab.com:44744 
> <http://fl...@qafdsflinkw811.nn.five9lab.com:44744/>] with unknown UID is 
> irrecoverably failed. Address cannot be quarantined without knowing the UID, 
> gating instead for 5000 ms. (akka.remote.Remoting)
> [2018-03-21 00:34:15,266] WARN Association to 
> [akka.tcp://fl...@qafdsflinkw811.nn.five9lab.com:42413 
> <http://fl...@qafdsflinkw811.nn.five9lab.com:42413/>] with unknown UID is 
> irrecoverably failed. Address cannot be quarantined without knowing the UID, 
> gating instead for 5000 ms. (akka.remote.Remoting)
> 
> 
> The host is available, but I don't understand where port number comes from. 
> Task Manager uses another port (which is printed in logs on startup)
> Could you please help to understand why it happens?
> 
> Thank you,
> Alex
> 
> 
> On Wed, Mar 21, 2018 at 4:19 PM Alexander Smirnov 
> <alexander.smirn...@gmail.com <mailto:alexander.smirn...@gmail.com>> wrote:
> Hello,
> 
> I've assembled a standalone cluster of 3 task managers and 3 job managers(and 
> 3 ZK) following the instructions at 
> 
> https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/cluster_setup.html
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/cluster_setup.html>
>  and 
> https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/jobmanager_high_availability.html
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/jobmanager_high_availability.html>
> 
> It works ok, but randomly, task managers becomes unavailable. JobManager has 
> exception like below in logs:
> 
> 
> [2018-03-19 00:33:10,211] WARN Association with remote system 
> [akka.tcp://fl...@qafdsflinkw811.nn.five9lab.com:42413 
> <http://fl...@qafdsflinkw811.nn.five9lab.com:42413/>] has failed, address is 
> now gated for [5000] ms. Reason: [Association failed with 
> [akka.tcp://fl...@qafdsflinkw811.nn.five9lab.com:42413 
> <http://fl...@qafdsflinkw811.nn.five9lab.com:42413/>]] Caused by: [Connection 
> refused: qafdsflinkw811.nn.five9lab.com/10.5.61.124:42413 
> <http://qafdsflinkw811.nn.five9lab.com/10.5.61.124:42413>] 
> (akka.remote.ReliableDeliverySupervisor)
> [2018-03-21 00:30:35,975] ERROR Association to 
> [akka.tcp://fl...@qafdsflinkw811.nn.five9lab.com:35320 
> <http://fl...@qafdsflinkw811.nn.five9lab.com:35320/>] with UID [1500204560] 
> irrecoverably failed. Quarantining address. (akka.remote.Remoting)
> java.util.concurrent.TimeoutException: Remote system has been silent for too 
> long. (more than 48.0 hours)
>         at 
> akka.remote.ReliableDeliverySupervisor$$anonfun$idle$1.applyOrElse(Endpoint.scala:375)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>         at 
> akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>         at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 
> I can't find a reason for this exception, any ideas?
> 
> Thank you,
> Alex

Re: Standalone cluster instability

Reply via email to