Re: Setting properties in core-site.xml for Spark and Hadoop to access

Nicholas Chammas Fri, 07 Mar 2014 13:11:54 -0800

Mayur,

So looking at the section on environment variables
here<http://spark.incubator.apache.org/docs/latest/configuration.html#environment-variables>,
are you saying to set these options via SPARK_JAVA_OPTS -D? On a related
note, in looking around I just discovered this command line tool for
modifying XML files called
XMLStarlet<http://xmlstar.sourceforge.net/overview.php>.
Perhaps I should instead set these S3 keys directly in the right
core-site.xml using XMLStarlet.

Devs/Everyone,

On a related note, I discovered that Spark (on EC2) reads Hadoop options
from /root/ephemeral-hdfs/conf/core-site.xml.

This is surprising given the variety of copies of core-site.xml on the EC2
cluster that gets built by spark-ec2. A quick search yields the following
relevant results (snipped):

find / -name core-site.xml 2> /dev/null

/root/mapreduce/conf/core-site.xml
/root/persistent-hdfs/conf/core-site.xml
/root/ephemeral-hdfs/conf/core-site.xml
/root/spark/conf/core-site.xml

It looks like both pyspark and ephemeral-hdfs/bin/hadoop read configs from
the ephemeral-hdfs core-site.xml file. The latter is expected; the former
is not. Is this intended behavior?

I expected pyspark to read configs from the spark core-site.xml file. The
moment I remove my AWS credentials from the ephemeral-hdfs config file,
pyspark cannot open files in S3 without me providing the credentials
in-line.

I also guessed that the config file under /root/mapreduce might be a kind
of base config file that both Spark and Hadoop would read from first, and
then override with configs from the other files. The path to the config
suggests that, but it doesn't appear to be the case. Adding my AWS keys to
that file seemed to affect neither Spark nor ephemeral-hdfs/bin/hadoop.

Nick

On Fri, Mar 7, 2014 at 2:07 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote:

> Set them as environment variable at boot & configure both stacks to call
> on that..
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Fri, Mar 7, 2014 at 9:32 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> On spinning up a Spark cluster in EC2, I'd like to set a few configs that
>> will allow me to access files in S3 without having to specify my AWS access
>> and secret keys over and over, as described 
>> here<http://stackoverflow.com/a/3033403/877069>
>> .
>>
>> The properties are fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey.
>>
>> Is there a way to set these properties programmatically so that Spark
>> (via the shell) and Hadoop (via distcp) are both aware of and use the
>> values?
>>
>> I don't think SparkConf does what I need because I want Hadoop to also be
>> aware of my AWS keys. When I set those properties using conf.set() in
>> pyspark, distcp didn't appear to be aware of them.
>>
>> Nick
>>
>>
>> ------------------------------
>> View this message in context: Setting properties in core-site.xml for
>> Spark and Hadoop to 
>> access<http://apache-spark-user-list.1001560.n3.nabble.com/Setting-properties-in-core-site-xml-for-Spark-and-Hadoop-to-access-tp2402.html>
>> Sent from the Apache Spark User List mailing list 
>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>
>
>

Re: Setting properties in core-site.xml for Spark and Hadoop to access

Reply via email to