I got this working by having our sysadmin update our security group to allow incoming traffic from the local subnet on ports 10000-65535. I'm not sure if there's a more specific range I could have used, but so far, everything is running!
Thanks for all the responses Marcelo and Andrew!! Matt On Thu, Jul 17, 2014 at 9:10 PM, Andrew Or <and...@databricks.com> wrote: > Hi Matt, > > The security group shouldn't be an issue; the ports listed in > `spark_ec2.py` are only for communication with the outside world. > > How did you launch your application? I notice you did not launch your > driver from your Master node. What happens if you did? Another thing is > that there seems to be some inconsistency or missing pieces in the logs you > posted. After an executor says "driver disassociated," what happens in the > driver logs? Is an exception thrown or something? > > It would be useful if you could also post your conf/spark-env.sh. > > Andrew > > > 2014-07-17 14:11 GMT-07:00 Marcelo Vanzin <van...@cloudera.com>: > > Hi Matt, >> >> I'm not very familiar with setup on ec2; the closest I can point you >> at is to look at the "launch_cluster" in ec2/spark_ec2.py, where the >> ports seem to be configured. >> >> >> On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr >> <mattcoarr.w...@gmail.com> wrote: >> > Thanks Marcelo! This is a huge help!! >> > >> > Looking at the executor logs (in a vanilla spark install, I'm finding >> them >> > in $SPARK_HOME/work/*)... >> > >> > It launches the executor, but it looks like the >> CoarseGrainedExecutorBackend >> > is having trouble talking to the driver (exactly what you said!!!). >> > >> > Do you know what the range of random ports that is used for the the >> > executor-to-driver? Is that range adjustable? Any config setting or >> > environment variable? >> > >> > I manually setup my ec2 security group to include all the ports that the >> > spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security >> > groups. They included (for those listed above 10000): >> > 19999 >> > 50060 >> > 50070 >> > 50075 >> > 60060 >> > 60070 >> > 60075 >> > >> > Obviously I'll need to make some adjustments to my EC2 security group! >> Just >> > need to figure out exactly what should be in there. To keep things >> simple, >> > I just have one security group for the master, slaves, and the driver >> > machine. >> > >> > In listing the port ranges in my current security group I looked at the >> > ports that spark_ec2.py sets up as well as the ports listed in the >> "spark >> > standalone mode" documentation page under "configuring ports for network >> > security": >> > >> > http://spark.apache.org/docs/latest/spark-standalone.html >> > >> > >> > Here are the relevant fragments from the executor log: >> > >> > Spark Executor Command: "/cask/jdk/bin/java" "-cp" >> > >> "::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3. >> > >> > >> 2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar" >> > "-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka. >> > >> > frameSize=100" "-Xms512M" "-Xmx512M" >> > "org.apache.spark.executor.CoarseGrainedExecutorBackend" >> > "akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra >> > >> > inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8" >> > "akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker" >> > "app-20140717195146-0000" >> > >> > ======================================== >> > >> > ... >> > >> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the >> custom-built >> > native-hadoop library... >> > >> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop >> with >> > error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path >> > >> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: >> > >> java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib >> > >> > 14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop >> > library for your platform... using builtin-java classes where applicable >> > >> > 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling >> back >> > to shell based >> > >> > 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group >> mapping >> > impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping >> > >> > 14/07/17 19:51:48 DEBUG Groups: Group mapping >> > impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; >> > cacheTimeout=300000 >> > >> > 14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user >> > >> > ... >> > >> > >> > 14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to >> driver: >> > akka.tcp://spark@ip-10-202-11-191.ec2.internal >> :46787/user/CoarseGrainedScheduler >> > >> > 14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker >> > akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker >> > >> > 14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to >> > akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker >> > >> > 14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver >> Disassociated >> > [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] -> >> > [akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated! >> > Shutting down. >> > >> > >> > Thanks a bunch! >> > Matt >> > >> > >> > On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin <van...@cloudera.com> >> wrote: >> >> >> >> When I meant the executor log, I meant the log of the process launched >> >> by the worker, not the worker. In my CDH-based Spark install, those >> >> end up in /var/run/spark/work. >> >> >> >> If you look at your worker log, you'll see it's launching the executor >> >> process. So there should be something there. >> >> >> >> Since you say it works when both are run in the same node, that >> >> probably points to some communication issue, since the executor needs >> >> to connect back to the driver. Check to see if you don't have any >> >> firewalls blocking the ports Spark tries to use. (That's one of the >> >> non-resource-related cases that will cause that message.) >> >> >> >> -- >> Marcelo >> > >