Re: advice on maintaining a production spark cluster?

Josh Marcus Tue, 20 May 2014 20:22:17 -0700

Aaron:

I see this in the Master's logs:

14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same
address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038
14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker
worker-20140520011737-hdn3.int.meetup.com-50038

There was an executor that launched that did fail, such as:
14/05/20 01:16:05 INFO Master: Launching executor app-20140520011605-0001/2
on worker worker-20140519155427-hdn3.int.meetup.com-50
038
14/05/20 01:17:37 INFO Master: Removing executor app-20140520011605-0001/2
because it is FAILED

... but other executors on other machines also failed without permanently
disassociating.

There are these messages which I don't know if they are related:
14/05/20 01:17:38 INFO LocalActorRef: Message
[akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://sparkMaste
r/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.
6.19%3A47252-18#1027788678] was not delivered. [3] dead letters
encountered. This logging can be turned off or adjusted with confi
guration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/05/20 01:17:38 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka
://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM
aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead
letters encountered. This logging can be turned off or adjust
ed with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.

On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> Unfortunately, those errors are actually due to an Executor that exited,
> such that the connection between the Worker and Executor failed. This is
> not a fatal issue, unless there are analogous messages from the Worker to
> the Master (which should be present, if they exist, at around the same
> point in time).
>
> Do you happen to have the logs from the Master that indicate that the
> Worker terminated? Is it just an Akka disassociation, or some exception?
>
>
> On Tue, May 20, 2014 at 12:53 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> This isn't helpful of me to say, but, I see the same sorts of problem
>> and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight
>> into when it happens, but usually after heavy use and after running
>> for a long time. I had figured I'd see if the changes since 0.9.0
>> addressed it and revisit later.
>>
>> On Tue, May 20, 2014 at 8:37 PM, Josh Marcus <jmar...@meetup.com> wrote:
>> > So, for example, I have two disassociated worker machines at the moment.
>> > The last messages in the spark logs are akka association error messages,
>> > like the following:
>> >
>> > 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError
>> > [akka.tcp://sparkwor...@hdn3.int.meetup.com:50038] ->
>> > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]: Error
>> [Association
>> > failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [
>> > akka.remote.EndpointAssociationException: Association failed with
>> > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]
>> > Caused by:
>> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> > Connection refused: hdn3.int.meetup.com/10.3.6.23:46288
>> > ]
>> >
>> > On the master side, there are lots and lots of messages of the form:
>> >
>> > 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker
>> > worker-20140520011737-hdn3.int.meetup.com-50038
>> >
>> > --j
>> >
>> >
>>
>
>

Re: advice on maintaining a production spark cluster?

Reply via email to