I put the settings as you specified in spark-env.sh for the master. When
I run start-all.sh, the web UI shows both the worker on the master
(machine1) and the slave worker (machine2) as ALIVE and ready, with the
master URL at spark://192.168.1.101. However, when I run spark-submit,
it immediately crashes with
py4j.protocol.Py4JJavaError14/06/27 09:01:32 ERROR Remoting: Remoting
error: [Startup failed]
akka.remote.RemoteTransportException: Startup failed
[...]
org.jboss.netty.channel.ChannelException: Failed to bind to
/192.168.1.101:5060
[...]
java.net.BindException: Address already in use.
[...]
This seems entirely contrary to intuition; why would Spark be unable to
bind to the exact IP:port set for the master?
On 6/27/14, 1:54 AM, Akhil Das wrote:
Hi Shannon,
How about a setting like the following? (just removed the quotes)
export SPARK_MASTER_IP=192.168.1.101
export SPARK_MASTER_PORT=5060
#export SPARK_LOCAL_IP=127.0.0.1
Not sure whats happening in your case, it could be that your system is
not able to bind to 192.168.1.101 address. What is the spark:// master
url that you are seeing there in the webUI? (It should be
spark://192.168.1.101:7077 in your case).
Thanks
Best Regards
On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn <squ...@gatech.edu
<mailto:squ...@gatech.edu>> wrote:
In the interest of completeness, this is how I invoke spark:
[on master]
> sbin/start-all.sh
> spark-submit --py-files extra.py main.py
iPhone'd
On Jun 26, 2014, at 17:29, Shannon Quinn <squ...@gatech.edu
<mailto:squ...@gatech.edu>> wrote:
My *best guess* (please correct me if I'm wrong) is that the
master (machine1) is sending the command to the worker (machine2)
with the localhost argument as-is; that is, machine2 isn't doing
any weird address conversion on its end.
Consequently, I've been focusing on the settings of the
master/machine1. But I haven't found anything to indicate where
the localhost argument could be coming from. /etc/hosts lists
only 127.0.0.1 as localhost; spark-defaults.conf list
spark.master as the full IP address (not 127.0.0.1); spark-env.sh
on the master also lists the full IP under SPARK_MASTER_IP. The
*only* place on the master where it's associated with localhost
is SPARK_LOCAL_IP.
In looking at the logs of the worker spawned on master, it's also
receiving a "spark://localhost:5060" argument, but since it
resides on the master that works fine. Is it possible that the
master is, for some reason, passing
"spark://{SPARK_LOCAL_IP}:5060" to the workers?
That was my motivation behind commenting out SPARK_LOCAL_IP;
however, that's when the master crashes immediately due to the
address already being in use.
Any ideas? Thanks!
Shannon
On 6/26/14, 10:14 AM, Akhil Das wrote:
Can you paste your spark-env.sh file?
Thanks
Best Regards
On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn
<squ...@gatech.edu <mailto:squ...@gatech.edu>> wrote:
Both /etc/hosts have each other's IP addresses in them.
Telneting from machine2 to machine1 on port 5060 works just
fine.
Here's the output of lsof:
user@machine1:~/spark/spark-1.0.0-bin-hadoop2$
<mailto:user@machine1:%7E/spark/spark-1.0.0-bin-hadoop2$>
lsof -i:5060
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 23985 user 30u IPv6 11092354 0t0 TCP
machine1:sip (LISTEN)
java 23985 user 40u IPv6 11099560 0t0 TCP
machine1:sip->machine1:48315 (ESTABLISHED)
java 23985 user 52u IPv6 11100405 0t0 TCP
machine1:sip->machine2:54476 (ESTABLISHED)
java 24157 user 40u IPv6 11092413 0t0 TCP
machine1:48315->machine1:sip (ESTABLISHED)
Ubuntu seems to recognize 5060 as the standard port for
"sip"; it's not actually running anything there besides
Spark, it just does a s/5060/sip/g.
Is there something to the fact that every time I comment out
SPARK_LOCAL_IP in spark-env, it crashes immediately upon
spark-submit due to the "address already being in use"? Or
am I barking up the wrong tree on that one?
Thanks again for all your help; I hope we can knock this one
out.
Shannon
On 6/26/14, 9:13 AM, Akhil Das wrote:
Do you have <ip> machine1 in your workers
/etc/hosts also? If so try telneting from your machine2 to
machine1 on port 5060. Also make sure nothing else is
running on port 5060 other than Spark (*/lsof -i:5060/*)
Thanks
Best Regards
On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn
<squ...@gatech.edu <mailto:squ...@gatech.edu>> wrote:
Still running into the same problem. /etc/hosts on the
master says
127.0.0.1 localhost
<ip> machine1
<ip> is the same address set in spark-env.sh for
SPARK_MASTER_IP. Any other ideas?
On 6/26/14, 3:11 AM, Akhil Das wrote:
Hi Shannon,
It should be a configuration issue, check in your
/etc/hosts and make sure localhost is not associated
with the SPARK_MASTER_IP you provided.
Thanks
Best Regards
On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
<squ...@gatech.edu <mailto:squ...@gatech.edu>> wrote:
Hi all,
I have a 2-machine Spark network I've set up: a
master and worker on machine1, and worker on
machine2. When I run 'sbin/start-all.sh',
everything starts up as it should. I see both
workers listed on the UI page. The logs of both
workers indicate successful registration with the
Spark master.
The problems begin when I attempt to submit a job:
I get an "address already in use" exception that
crashes the program. It says "Failed to bind to "
and lists the exact port and address of the master.
At this point, the only items I have set in my
spark-env.sh are SPARK_MASTER_IP and
SPARK_MASTER_PORT (non-standard, set to 5060).
The next step I took, then, was to explicitly set
SPARK_LOCAL_IP on the master to 127.0.0.1. This
allows the master to successfully send out the
jobs; however, it ends up canceling the stage
after running this command several times:
14/06/25 21:00:47 INFO AppClient$ClientActor:
Executor added: app-20140625210032-0000/8 on
worker-20140625205623-machine2-53597
(machine2:53597) with 8 cores
14/06/25 21:00:47 INFO
SparkDeploySchedulerBackend: Granted executor ID
app-20140625210032-0000/8 on hostPort
machine2:53597 with 8 cores, 8.0 GB RAM
14/06/25 21:00:47 INFO AppClient$ClientActor:
Executor updated: app-20140625210032-0000/8 is now
RUNNING
14/06/25 21:00:49 INFO AppClient$ClientActor:
Executor updated: app-20140625210032-0000/8 is now
FAILED (Command exited with code 1)
The "/8" started at "/1", eventually becomes "/9",
and then "/10", at which point the program
crashes. The worker on machine2 shows similar
messages in its logs. Here are the last bunch:
14/06/25 21:00:31 INFO Worker: Executor
app-20140625210032-0000/9 finished with state
FAILED message Command exited with code 1 exitStatus 1
14/06/25 21:00:31 INFO Worker: Asked to launch
executor app-20140625210032-0000/10 for app_name
Spark assembly has been built with Hive, including
Datanucleus jars on classpath
14/06/25 21:00:32 INFO ExecutorRunner: Launch
command: "java" "-cp"
"::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
"-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
"10" "machine2" "8"
"akka.tcp://sparkWorker@machine2:53597/user/Worker"
"app-20140625210032-0000"
14/06/25 21:00:33 INFO Worker: Executor
app-20140625210032-0000/10 finished with state
FAILED message Command exited with code 1 exitStatus 1
I highlighted the part that seemed strange to me;
that's the master port number (I set it to 5060),
and yet it's referencing localhost? Is this the
reason why machine2 apparently can't seem to give
a confirmation to the master once the job is
submitted? (The logs from the worker on the master
node indicate that it's running just fine)
I appreciate any assistance you can offer!
Regards,
Shannon Quinn