Thanks Yana, My current experience here is after running some small spark-submit based tests the Master once again stopped being reachable. No change in the test setup. I restarted Master/Worker and still not reachable.
What might be the variables here in which association with the Master/Worker stops succeedng? For reference here are the Master/worker 501 34465 1 0 11:35AM ?? 0:06.50 /Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java -cp <classpath..> -Xms512m -Xmx512m -XX:MaxPermSize=128m org.apache.spark.deploy.worker.Worker spark://mellyrn.local:7077 501 34361 1 0 11:35AM ttys018 0:07.08 /Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java -cp <classpath..> -Xms512m -Xmx512m -XX:MaxPermSize=128m org.apache.spark.deploy.master.Master --ip mellyrn.local --port 7077 --webui-port 8080 15/05/27 11:36:37 INFO SparkUI: Started SparkUI at http://25.101.19.24:4040 15/05/27 11:36:37 INFO SparkContext: Added JAR file:/shared/spark-perf/mllib-tests/target/mllib-perf-tests-assembly.jar at http://25.101.19.24:60329/jars/mllib-perf-tests-assembly.jar with timestamp 1432751797662 15/05/27 11:36:37 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster@mellyrn.local:7077/user/Master... 15/05/27 11:36:37 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@mellyrn.local:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@mellyrn.local:7077 15/05/27 11:36:37 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077 15/05/27 11:36:57 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster@mellyrn.local:7077/user/Master... 15/05/27 11:36:57 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@mellyrn.local:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@mellyrn.local:7077 15/05/27 11:36:57 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077 15/05/27 11:37:17 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster@mellyrn.local:7077/user/Master... 15/05/27 11:37:17 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@mellyrn.local:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@mellyrn.local:7077 15/05/27 11:37:17 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077 15/05/27 11:37:37 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up. 15/05/27 11:37:37 WARN SparkDeploySchedulerBackend: Application ID is not initialized yet. 1 Even when successful, the time for the Master to come up has a surprisingly high variance. I am running on a single machine for which there is plenty of RAM. Note that was one problem before the present series : if RAM is tight then the failure modes can be unpredictable. But now the RAM is not an issue: plenty available for both Master and Worker. Within the same hour period and starting/stopping maybe a dozen times, the startup time for the Master may be a few seconds up to a couple to several minutes. 2015-05-20 7:39 GMT-07:00 Yana Kadiyska <yana.kadiy...@gmail.com>: > But if I'm reading his email correctly he's saying that: > > 1. The master and slave are on the same box (so network hiccups are > unlikely culprit) > 2. The failures are intermittent -- i.e program works for a while then > worker gets disassociated... > > Is it possible that the master restarted? We used to have problems like > this where we'd restart the master process, it won't be listening on 7077 > for some time, but the worker process is trying to connect and by the time > the master is up the worker has given up... > > > On Wed, May 20, 2015 at 5:16 AM, Evo Eftimov <evo.efti...@isecc.com> > wrote: > >> Check whether the name can be resolved in the /etc/hosts file (or DNS) of >> the worker >> >> >> >> (the same btw applies for the Node where you run the driver app – all >> other nodes must be able to resolve its name) >> >> >> >> *From:* Stephen Boesch [mailto:java...@gmail.com] >> *Sent:* Wednesday, May 20, 2015 10:07 AM >> *To:* user >> *Subject:* Intermittent difficulties for Worker to contact Master on >> same machine in standalone >> >> >> >> >> >> What conditions would cause the following delays / failure for a >> standalone machine/cluster to have the Worker contact the Master? >> >> >> >> 15/05/20 02:02:53 INFO WorkerWebUI: Started WorkerWebUI at >> http://10.0.0.3:8081 >> >> 15/05/20 02:02:53 INFO Worker: Connecting to master >> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master... >> >> 15/05/20 02:02:53 WARN Remoting: Tried to associate with unreachable >> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is >> now gated for 5000 ms, all messages to this address will be delivered to >> dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077 >> >> 15/05/20 02:03:04 INFO Worker: Retrying connection to master (attempt # 1) >> >> .. >> >> .. >> >> 15/05/20 02:03:26 INFO Worker: Retrying connection to master (attempt # 3) >> >> 15/05/20 02:03:26 INFO Worker: Connecting to master >> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master... >> >> 15/05/20 02:03:26 WARN Remoting: Tried to associate with unreachable >> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is >> now gated for 5000 ms, all messages to this address will be delivered to >> dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077 >> > >