Spark should effectively turn Akka's failure detector off, because we historically had problems with GCs and other issues causing disassociations. The only thing that should cause these messages nowadays is if the TCP connection (which Akka sustains between Actor Systems on different machines) actually drops. TCP connections are pretty resilient, so one common cause of this is actual Executor failure -- recently, I have experienced a similar-sounding problem due to my machine's OOM killer terminating my Executors, such that they didn't produce any error output.
On Thu, May 22, 2014 at 9:19 AM, Chanwit Kaewkasi <chan...@gmail.com> wrote: > Hi all, > > On an ARM cluster, I have been testing a wordcount program with JRE 7 > and everything is OK. But when changing to the embedded version of > Java SE (Oracle's eJRE), the same program cannot complete all > computing stages. > > It is failed by many Akka's disassociation. > > - I've been trying to increase Akka's timeout but still stuck. I am > not sure what is the right way to do so? (I suspected that GC pausing > the world is causing this). > > - Another question is that how could I properly turn on Akka's logging > to see what's the root cause of this disassociation problem? (If my > guess about GC is wrong). > > Best regards, > > -chanwit > > -- > Chanwit Kaewkasi > linkedin.com/in/chanwit >