Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

Stephen Boesch Wed, 27 May 2015 11:45:20 -0700

Thanks Yana,

   My current experience here is after running some small spark-submit
based tests the Master once again stopped being reachable.  No change in
the test setup.  I restarted Master/Worker and still not reachable.

What might be the variables here in which association with the
Master/Worker stops succeedng?

For reference here are the Master/worker

  501 34465     1   0 11:35AM ??         0:06.50
/Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java
-cp <classpath..> -Xms512m -Xmx512m -XX:MaxPermSize=128m
org.apache.spark.deploy.worker.Worker spark://mellyrn.local:7077
  501 34361     1   0 11:35AM ttys018    0:07.08
/Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java
-cp <classpath..>  -Xms512m -Xmx512m -XX:MaxPermSize=128m
org.apache.spark.deploy.master.Master --ip mellyrn.local --port 7077
--webui-port 8080

15/05/27 11:36:37 INFO SparkUI: Started SparkUI at http://25.101.19.24:4040
15/05/27 11:36:37 INFO SparkContext: Added JAR
file:/shared/spark-perf/mllib-tests/target/mllib-perf-tests-assembly.jar at
http://25.101.19.24:60329/jars/mllib-perf-tests-assembly.jar with timestamp
1432751797662
15/05/27 11:36:37 INFO AppClient$ClientActor: Connecting to master
akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
15/05/27 11:36:37 WARN AppClient$ClientActor: Could not connect to
akka.tcp://sparkMaster@mellyrn.local:7077: akka.remote.InvalidAssociation:
Invalid address: akka.tcp://sparkMaster@mellyrn.local:7077
15/05/27 11:36:37 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is now gated
for 5000 ms, all messages to this address will be delivered to dead
letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
15/05/27 11:36:57 INFO AppClient$ClientActor: Connecting to master
akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
15/05/27 11:36:57 WARN AppClient$ClientActor: Could not connect to
akka.tcp://sparkMaster@mellyrn.local:7077: akka.remote.InvalidAssociation:
Invalid address: akka.tcp://sparkMaster@mellyrn.local:7077
15/05/27 11:36:57 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is now gated
for 5000 ms, all messages to this address will be delivered to dead
letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
15/05/27 11:37:17 INFO AppClient$ClientActor: Connecting to master
akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
15/05/27 11:37:17 WARN AppClient$ClientActor: Could not connect to
akka.tcp://sparkMaster@mellyrn.local:7077: akka.remote.InvalidAssociation:
Invalid address: akka.tcp://sparkMaster@mellyrn.local:7077
15/05/27 11:37:17 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is now gated
for 5000 ms, all messages to this address will be delivered to dead
letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
15/05/27 11:37:37 ERROR SparkDeploySchedulerBackend: Application has been
killed. Reason: All masters are unresponsive! Giving up.
15/05/27 11:37:37 WARN SparkDeploySchedulerBackend: Application ID is not
initialized yet.
1

Even when successful, the time for the Master to come up has a surprisingly
high variance. I am running on a single machine for which there is plenty
of RAM. Note that was one problem before the present series :  if RAM is
tight then the failure modes can be unpredictable. But now the RAM is not
an issue: plenty available for both Master and Worker.

Within the same hour period and starting/stopping maybe a dozen times, the
startup time for the Master may be a few seconds up to  a couple to several
minutes.

2015-05-20 7:39 GMT-07:00 Yana Kadiyska <yana.kadiy...@gmail.com>:

> But if I'm reading his email correctly he's saying that:
>
> 1. The master and slave are on the same box (so network hiccups are
> unlikely culprit)
> 2. The failures are intermittent -- i.e program works for a while then
> worker gets disassociated...
>
> Is it possible that the master restarted? We used to have problems like
> this where we'd restart the master process, it won't be listening on 7077
> for some time, but the worker process is trying to connect and by the time
> the master is up the worker has given up...
>
>
> On Wed, May 20, 2015 at 5:16 AM, Evo Eftimov <evo.efti...@isecc.com>
> wrote:
>
>> Check whether the name can be resolved in the /etc/hosts file (or DNS) of
>> the worker
>>
>>
>>
>> (the same btw applies for the Node where you run the driver app – all
>> other nodes must be able to resolve its name)
>>
>>
>>
>> *From:* Stephen Boesch [mailto:java...@gmail.com]
>> *Sent:* Wednesday, May 20, 2015 10:07 AM
>> *To:* user
>> *Subject:* Intermittent difficulties for Worker to contact Master on
>> same machine in standalone
>>
>>
>>
>>
>>
>> What conditions would cause the following delays / failure for a
>> standalone machine/cluster to have the Worker contact the Master?
>>
>>
>>
>> 15/05/20 02:02:53 INFO WorkerWebUI: Started WorkerWebUI at
>> http://10.0.0.3:8081
>>
>> 15/05/20 02:02:53 INFO Worker: Connecting to master
>> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
>>
>> 15/05/20 02:02:53 WARN Remoting: Tried to associate with unreachable
>> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
>> now gated for 5000 ms, all messages to this address will be delivered to
>> dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077
>>
>> 15/05/20 02:03:04 INFO Worker: Retrying connection to master (attempt # 1)
>>
>> ..
>>
>> ..
>>
>> 15/05/20 02:03:26 INFO Worker: Retrying connection to master (attempt # 3)
>>
>> 15/05/20 02:03:26 INFO Worker: Connecting to master
>> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
>>
>> 15/05/20 02:03:26 WARN Remoting: Tried to associate with unreachable
>> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
>> now gated for 5000 ms, all messages to this address will be delivered to
>> dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077
>>
>
>

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

Reply via email to