woops, sorry! Whenever I read the word "deadlock" I getting a bit nervous and distracted ;-)
2015-06-19 15:21 GMT+02:00 Till Rohrmann <trohrm...@apache.org>: > I think Andra wrote that there is *no deadlock*. > > On Fri, Jun 19, 2015 at 3:18 PM Fabian Hueske fhue...@gmail.com > <http://mailto:fhue...@gmail.com> wrote: > > Hi Andra, > > > > The system should never deadlock. > > There is a bug somewhere if that happens. > > > > Can you check if the program is really stuck? > > > > Cheers, Fabian > > > > 2015-06-19 15:08 GMT+02:00 Till Rohrmann <trohrm...@apache.org>: > > > > > What does forever mean? Usually it's the case that you see a steep > > decline > > > in performance once the system starts spilling data to disk because of > > the > > > disk I/O bottleneck. > > > > > > The system always starts spilling to disk when it has no more memory > left > > > for its operations. So for example if you want to sort data which > cannot > > be > > > kept completely in memory, then the system has to employ external > > sorting. > > > If you can give Flink more memory, then you can avoid this problem > > > depending on the actual data size. > > > > > > You can observe disk I/O by using the `iotop` command for example. If > you > > > see Flink having a high I/O usage, then this might be an indicator for > > > spilling. > > > > > > Cheers, > > > Till > > > > > > On Fri, Jun 19, 2015 at 2:54 PM Andra Lungu <lungu.an...@gmail.com> > > wrote: > > > > > > > Another problem that I encountered during the same set of > > > experiments(sorry > > > > if I am asking too many questions, I am eager to get things fixed): > > > > - for the same configuration, a piece of code runs perfectly on 10GB > of > > > > input, then for 38GB it runs forever (no deadlock). > > > > > > > > I believe that may occur because Flink spills information to disk > every > > > > time it runs out of memory... Is this fixable by increasing the > number > > of > > > > buffers? > > > > > > > > That's the last question for today, promise :) > > > > > > > > Thanks! > > > > > > > > On Fri, Jun 19, 2015 at 2:40 PM, Till Rohrmann <trohrm...@apache.org > > > > > > wrote: > > > > > > > > > Yes, it was an issue for the milestone release. > > > > > > > > > > On Fri, Jun 19, 2015 at 2:18 PM Andra Lungu <lungu.an...@gmail.com > > > > > > wrote: > > > > > > > > > > > Yes, so I am using flink-0.9.0-milestone-1. Was it a problem for > > this > > > > > > version? > > > > > > I'll just fetch the latest master if this is the case. > > > > > > > > > > > > On Fri, Jun 19, 2015 at 2:12 PM, Till Rohrmann < > > trohrm...@apache.org > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi Andra, > > > > > > > > > > > > > > the problem seems to be that the deployment of some tasks takes > > > > longer > > > > > > than > > > > > > > 100s. From the stack trace it looks as if you're not using the > > > latest > > > > > > > master. > > > > > > > > > > > > > > We had problems with previous version where the deployment call > > > > waited > > > > > > for > > > > > > > the TM to completely download the user code jars. For large > > setups > > > > the > > > > > > > BlobServer became a bottleneck and some of the deployment calls > > > timed > > > > > > out. > > > > > > > We updated the deployment logic so that the TM sends an > immediate > > > ACK > > > > > > backt > > > > > > > to the JM when it receives a new task. > > > > > > > > > > > > > > Could you verify which version of Flink you're running and in > > case > > > > that > > > > > > > it's not the latest master, could you please try to run your > > > example > > > > > with > > > > > > > the latest code? > > > > > > > > > > > > > > Cheers, > > > > > > > Till > > > > > > > > > > > > > > On Fri, Jun 19, 2015 at 1:42 PM Andra Lungu < > > lungu.an...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > I ran a job this morning on 30 wally nodes. DOP 224. Worked > > like > > > a > > > > > > charm. > > > > > > > > > > > > > > > > Then, I ran a similar job, on the exact same configuration, > on > > > the > > > > > same > > > > > > > > input data set. The only difference between them is that the > > > second > > > > > job > > > > > > > > computes the degrees per vertex and, for vertices with degree > > > > higher > > > > > > > than a > > > > > > > > user-defined threshold, it does a bit of magic(roughly a > bunch > > of > > > > > > > > coGroups). The problem is that, even before the extra > functions > > > get > > > > > > > called, > > > > > > > > I get the following type of exception: > > > > > > > > > > > > > > > > 06/19/2015 12:06:43 CHAIN FlatMap (FlatMap at > > > > > > > > fromDataSet(Graph.java:171)) -> Combine(Distinct at > > > > > > > > fromDataSet(Graph.java:171))(222/224) switched to FAILED > > > > > > > > java.lang.IllegalStateException: Update task on instance > > > > > > > > 29073fb0b0957198a2b67569b042d56b @ wally004 - 8 slots - URL: > > > > > > akka.tcp:// > > > > > > > > flink@130.149.249.14:44528/user/taskmanager failed due to: > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:860) > > > > > > > > at akka.dispatch.OnFailure.internal(Future.scala:228) > > > > > > > > at akka.dispatch.OnFailure.internal(Future.scala:227) > > > > > > > > at > > > > akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) > > > > > > > > at > > > > akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) > > > > > > > > at > > > > > > > > > > > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25) > > > > > > > > at > > > > > > > > > > > > scala.concurrent.Future$anonfun$onFailure$1.apply(Future.scala:136) > > > > > > > > at > > > > > > > > > > > > scala.concurrent.Future$anonfun$onFailure$1.apply(Future.scala:134) > > > > > > > > at > > > > > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > scala.concurrent.impl.ExecutionContextImpl$anon$3.exec(ExecutionContextImpl.scala:107) > > > > > > > > at > > > > > > > > > > > > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > > > > > > > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > > > > > > > > [Actor[akka.tcp:// > > > > > flink@130.149.249.14:44528/user/taskmanager#82700874 > > > > > > ]] > > > > > > > > after [100000 ms] > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > akka.pattern.PromiseActorRef$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > > > > > > > > at > akka.actor.Scheduler$anon$7.run(Scheduler.scala:117) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$unbatchedExecute(Future.scala:694) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > akka.actor.LightArrayRevolverScheduler$anon$8.executeBucket$1(Scheduler.scala:419) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > akka.actor.LightArrayRevolverScheduler$anon$8.nextTick(Scheduler.scala:423) > > > > > > > > at > > > > > > > > > > > > > > > akka.actor.LightArrayRevolverScheduler$anon$8.run(Scheduler.scala:375) > > > > > > > > at java.lang.Thread.run(Thread.java:722) > > > > > > > > > > > > > > > > > > > > > > > > At first I thought, okay maybe wally004 is down; then I > ssh'd > > > into > > > > > it. > > > > > > > > Works fine. > > > > > > > > > > > > > > > > The full output can be found here: > > > > > > > > https://gist.github.com/andralungu/d222b75cb33aea57955d > > > > > > > > > > > > > > > > Does anyone have any idea about what may have triggered this? > > :( > > > > > > > > > > > > > > > > Thanks! > > > > > > > > Andra > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >