Hi Andra,

the problem seems to be that the deployment of some tasks takes longer than
100s. From the stack trace it looks as if you're not using the latest
master.

We had problems with previous version where the deployment call waited for
the TM to completely download the user code jars. For large setups the
BlobServer became a bottleneck and some of the deployment calls timed out.
We updated the deployment logic so that the TM sends an immediate ACK backt
to the JM when it receives a new task.

Could you verify which version of Flink you're running and in case that
it's not the latest master, could you please try to run your example with
the latest code?

Cheers,
Till

On Fri, Jun 19, 2015 at 1:42 PM Andra Lungu <lungu.an...@gmail.com> wrote:

> Hi everyone,
>
> I ran a job this morning on 30 wally nodes. DOP 224. Worked like a charm.
>
> Then, I ran a similar job, on the exact same configuration, on the same
> input data set. The only difference between them is that the second job
> computes the degrees per vertex and, for vertices with degree higher than a
> user-defined threshold, it does a bit of magic(roughly a bunch of
> coGroups). The problem is that, even before the extra functions get called,
> I get the following type of exception:
>
> 06/19/2015 12:06:43     CHAIN FlatMap (FlatMap at
> fromDataSet(Graph.java:171)) -> Combine(Distinct at
> fromDataSet(Graph.java:171))(222/224) switched to FAILED
> java.lang.IllegalStateException: Update task on instance
> 29073fb0b0957198a2b67569b042d56b @ wally004 - 8 slots - URL: akka.tcp://
> flink@130.149.249.14:44528/user/taskmanager failed due to:
>         at
>
> org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:860)
>         at akka.dispatch.OnFailure.internal(Future.scala:228)
>         at akka.dispatch.OnFailure.internal(Future.scala:227)
>         at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
>         at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
>         at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>         at
>
> scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25)
>         at
> scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
>         at
> scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
>         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>         at
>
> scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka.tcp://flink@130.149.249.14:44528/user/taskmanager#82700874]]
> after [100000 ms]
>         at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
>         at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
>         at
>
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
>         at
>
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
>         at
>
> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
>         at
>
> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
>         at
> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
>         at java.lang.Thread.run(Thread.java:722)
>
>
>  At first I thought, okay maybe wally004 is down; then I ssh'd into it.
> Works fine.
>
> The full output can be found here:
> https://gist.github.com/andralungu/d222b75cb33aea57955d
>
> Does anyone have any idea about what may have triggered this? :(
>
> Thanks!
> Andra
>

Reply via email to