Can you paste your spark-env.sh file? Thanks Best Regards
On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <squ...@gatech.edu> wrote: > Both /etc/hosts have each other's IP addresses in them. Telneting from > machine2 to machine1 on port 5060 works just fine. > > Here's the output of lsof: > > user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060 > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > java 23985 user 30u IPv6 11092354 0t0 TCP machine1:sip (LISTEN) > java 23985 user 40u IPv6 11099560 0t0 TCP > machine1:sip->machine1:48315 (ESTABLISHED) > java 23985 user 52u IPv6 11100405 0t0 TCP > machine1:sip->machine2:54476 (ESTABLISHED) > java 24157 user 40u IPv6 11092413 0t0 TCP > machine1:48315->machine1:sip (ESTABLISHED) > > Ubuntu seems to recognize 5060 as the standard port for "sip"; it's not > actually running anything there besides Spark, it just does a s/5060/sip/g. > > Is there something to the fact that every time I comment out > SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit due > to the "address already being in use"? Or am I barking up the wrong tree on > that one? > > Thanks again for all your help; I hope we can knock this one out. > > Shannon > > > On 6/26/14, 9:13 AM, Akhil Das wrote: > > Do you have <ip> machine1 in your workers /etc/hosts also? If > so try telneting from your machine2 to machine1 on port 5060. Also make > sure nothing else is running on port 5060 other than Spark (*lsof -i:5060* > ) > > Thanks > Best Regards > > > On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <squ...@gatech.edu> wrote: > >> Still running into the same problem. /etc/hosts on the master says >> >> 127.0.0.1 localhost >> <ip> machine1 >> >> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any >> other ideas? >> >> >> On 6/26/14, 3:11 AM, Akhil Das wrote: >> >> Hi Shannon, >> >> It should be a configuration issue, check in your /etc/hosts and make >> sure localhost is not associated with the SPARK_MASTER_IP you provided. >> >> Thanks >> Best Regards >> >> >> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <squ...@gatech.edu> wrote: >> >>> Hi all, >>> >>> I have a 2-machine Spark network I've set up: a master and worker on >>> machine1, and worker on machine2. When I run 'sbin/start-all.sh', >>> everything starts up as it should. I see both workers listed on the UI >>> page. The logs of both workers indicate successful registration with the >>> Spark master. >>> >>> The problems begin when I attempt to submit a job: I get an "address >>> already in use" exception that crashes the program. It says "Failed to bind >>> to " and lists the exact port and address of the master. >>> >>> At this point, the only items I have set in my spark-env.sh are >>> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060). >>> >>> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the >>> master to 127.0.0.1. This allows the master to successfully send out the >>> jobs; however, it ends up canceling the stage after running this command >>> several times: >>> >>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added: >>> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597 >>> (machine2:53597) with 8 cores >>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID >>> app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 GB >>> RAM >>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated: >>> app-20140625210032-0000/8 is now RUNNING >>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated: >>> app-20140625210032-0000/8 is now FAILED (Command exited with code 1) >>> >>> The "/8" started at "/1", eventually becomes "/9", and then "/10", at >>> which point the program crashes. The worker on machine2 shows similar >>> messages in its logs. Here are the last bunch: >>> >>> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9 >>> finished with state FAILED message Command exited with code 1 exitStatus 1 >>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor >>> app-20140625210032-0000/10 for app_name >>> Spark assembly has been built with Hive, including Datanucleus jars on >>> classpath >>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp" >>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar" >>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M" >>> "org.apache.spark.executor.CoarseGrainedExecutorBackend" " >>> *akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*" "10" >>> "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker" >>> "app-20140625210032-0000" >>> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10 >>> finished with state FAILED message Command exited with code 1 exitStatus 1 >>> >>> I highlighted the part that seemed strange to me; that's the master port >>> number (I set it to 5060), and yet it's referencing localhost? Is this the >>> reason why machine2 apparently can't seem to give a confirmation to the >>> master once the job is submitted? (The logs from the worker on the master >>> node indicate that it's running just fine) >>> >>> I appreciate any assistance you can offer! >>> >>> Regards, >>> Shannon Quinn >>> >>> >> >> > >