Aaron: I see this in the Master's logs:
14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker worker-20140520011737-hdn3.int.meetup.com-50038 There was an executor that launched that did fail, such as: 14/05/20 01:16:05 INFO Master: Launching executor app-20140520011605-0001/2 on worker worker-20140519155427-hdn3.int.meetup.com-50 038 14/05/20 01:17:37 INFO Master: Removing executor app-20140520011605-0001/2 because it is FAILED ... but other executors on other machines also failed without permanently disassociating. There are these messages which I don't know if they are related: 14/05/20 01:17:38 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaste r/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3. 6.19%3A47252-18#1027788678] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with confi guration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/05/20 01:17:38 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka ://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead letters encountered. This logging can be turned off or adjust ed with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson <ilike...@gmail.com> wrote: > Unfortunately, those errors are actually due to an Executor that exited, > such that the connection between the Worker and Executor failed. This is > not a fatal issue, unless there are analogous messages from the Worker to > the Master (which should be present, if they exist, at around the same > point in time). > > Do you happen to have the logs from the Master that indicate that the > Worker terminated? Is it just an Akka disassociation, or some exception? > > > On Tue, May 20, 2014 at 12:53 PM, Sean Owen <so...@cloudera.com> wrote: > >> This isn't helpful of me to say, but, I see the same sorts of problem >> and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight >> into when it happens, but usually after heavy use and after running >> for a long time. I had figured I'd see if the changes since 0.9.0 >> addressed it and revisit later. >> >> On Tue, May 20, 2014 at 8:37 PM, Josh Marcus <jmar...@meetup.com> wrote: >> > So, for example, I have two disassociated worker machines at the moment. >> > The last messages in the spark logs are akka association error messages, >> > like the following: >> > >> > 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError >> > [akka.tcp://sparkwor...@hdn3.int.meetup.com:50038] -> >> > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]: Error >> [Association >> > failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [ >> > akka.remote.EndpointAssociationException: Association failed with >> > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288] >> > Caused by: >> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: >> > Connection refused: hdn3.int.meetup.com/10.3.6.23:46288 >> > ] >> > >> > On the master side, there are lots and lots of messages of the form: >> > >> > 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker >> > worker-20140520011737-hdn3.int.meetup.com-50038 >> > >> > --j >> > >> > >> > >