I got this working by having our sysadmin update our security group to
allow incoming traffic from the local subnet on ports 10000-65535.  I'm not
sure if there's a more specific range I could have used, but so far,
everything is running!

Thanks for all the responses Marcelo and Andrew!!

Matt


On Thu, Jul 17, 2014 at 9:10 PM, Andrew Or <and...@databricks.com> wrote:

> Hi Matt,
>
> The security group shouldn't be an issue; the ports listed in
> `spark_ec2.py` are only for communication with the outside world.
>
> How did you launch your application? I notice you did not launch your
> driver from your Master node. What happens if you did? Another thing is
> that there seems to be some inconsistency or missing pieces in the logs you
> posted. After an executor says "driver disassociated," what happens in the
> driver logs? Is an exception thrown or something?
>
> It would be useful if you could also post your conf/spark-env.sh.
>
> Andrew
>
>
> 2014-07-17 14:11 GMT-07:00 Marcelo Vanzin <van...@cloudera.com>:
>
> Hi Matt,
>>
>> I'm not very familiar with setup on ec2; the closest I can point you
>> at is to look at the "launch_cluster" in ec2/spark_ec2.py, where the
>> ports seem to be configured.
>>
>>
>> On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr
>> <mattcoarr.w...@gmail.com> wrote:
>> > Thanks Marcelo!  This is a huge help!!
>> >
>> > Looking at the executor logs (in a vanilla spark install, I'm finding
>> them
>> > in $SPARK_HOME/work/*)...
>> >
>> > It launches the executor, but it looks like the
>> CoarseGrainedExecutorBackend
>> > is having trouble talking to the driver (exactly what you said!!!).
>> >
>> > Do you know what the range of random ports that is used for the the
>> > executor-to-driver?  Is that range adjustable?  Any config setting or
>> > environment variable?
>> >
>> > I manually setup my ec2 security group to include all the ports that the
>> > spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security
>> > groups.  They included (for those listed above 10000):
>> > 19999
>> > 50060
>> > 50070
>> > 50075
>> > 60060
>> > 60070
>> > 60075
>> >
>> > Obviously I'll need to make some adjustments to my EC2 security group!
>>  Just
>> > need to figure out exactly what should be in there.  To keep things
>> simple,
>> > I just have one security group for the master, slaves, and the driver
>> > machine.
>> >
>> > In listing the port ranges in my current security group I looked at the
>> > ports that spark_ec2.py sets up as well as the ports listed in the
>> "spark
>> > standalone mode" documentation page under "configuring ports for network
>> > security":
>> >
>> > http://spark.apache.org/docs/latest/spark-standalone.html
>> >
>> >
>> > Here are the relevant fragments from the executor log:
>> >
>> > Spark Executor Command: "/cask/jdk/bin/java" "-cp"
>> >
>> "::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.
>> >
>> >
>> 2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
>> > "-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka.
>> >
>> > frameSize=100" "-Xms512M" "-Xmx512M"
>> > "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>> > "akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra
>> >
>> > inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8"
>> > "akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
>> > "app-20140717195146-0000"
>> >
>> > ========================================
>> >
>> > ...
>> >
>> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the
>> custom-built
>> > native-hadoop library...
>> >
>> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop
>> with
>> > error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
>> >
>> > 14/07/17 19:51:47 DEBUG NativeCodeLoader:
>> >
>> java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
>> >
>> > 14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
>> > library for your platform... using builtin-java classes where applicable
>> >
>> > 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling
>> back
>> > to shell based
>> >
>> > 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group
>> mapping
>> > impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
>> >
>> > 14/07/17 19:51:48 DEBUG Groups: Group mapping
>> > impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
>> > cacheTimeout=300000
>> >
>> > 14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user
>> >
>> > ...
>> >
>> >
>> > 14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to
>> driver:
>> > akka.tcp://spark@ip-10-202-11-191.ec2.internal
>> :46787/user/CoarseGrainedScheduler
>> >
>> > 14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker
>> > akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
>> >
>> > 14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to
>> > akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
>> >
>> > 14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver
>> Disassociated
>> > [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] ->
>> > [akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated!
>> > Shutting down.
>> >
>> >
>> > Thanks a bunch!
>> > Matt
>> >
>> >
>> > On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin <van...@cloudera.com>
>> wrote:
>> >>
>> >> When I meant the executor log, I meant the log of the process launched
>> >> by the worker, not the worker. In my CDH-based Spark install, those
>> >> end up in /var/run/spark/work.
>> >>
>> >> If you look at your worker log, you'll see it's launching the executor
>> >> process. So there should be something there.
>> >>
>> >> Since you say it works when both are run in the same node, that
>> >> probably points to some communication issue, since the executor needs
>> >> to connect back to the driver. Check to see if you don't have any
>> >> firewalls blocking the ports Spark tries to use. (That's one of the
>> >> non-resource-related cases that will cause that message.)
>>
>>
>>
>> --
>> Marcelo
>>
>
>

Reply via email to