Digging up this thread to ask a follow-up question: What is the intended use for /root/spark/conf/core-site.xml?
It seems that both /root/spark/bin/pyspark and /root/ ephemeral-hdfs/bin/hadoop point to /root/ephemeral-hdfs/conf/core-site.xml. If I specify S3 access keys in spark/conf, Spark doesn't seem to pick them up. Nick On Fri, Mar 7, 2014 at 4:10 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > Mayur, > > So looking at the section on environment variables > here<http://spark.incubator.apache.org/docs/latest/configuration.html#environment-variables>, > are you saying to set these options via SPARK_JAVA_OPTS -D? On a related > note, in looking around I just discovered this command line tool for > modifying XML files called > XMLStarlet<http://xmlstar.sourceforge.net/overview.php>. > Perhaps I should instead set these S3 keys directly in the right > core-site.xml using XMLStarlet. > > Devs/Everyone, > > On a related note, I discovered that Spark (on EC2) reads Hadoop options > from /root/ephemeral-hdfs/conf/core-site.xml. > > This is surprising given the variety of copies of core-site.xml on the EC2 > cluster that gets built by spark-ec2. A quick search yields the following > relevant results (snipped): > > find / -name core-site.xml 2> /dev/null > > /root/mapreduce/conf/core-site.xml > /root/persistent-hdfs/conf/core-site.xml > /root/ephemeral-hdfs/conf/core-site.xml > /root/spark/conf/core-site.xml > > > It looks like both pyspark and ephemeral-hdfs/bin/hadoop read configs from > the ephemeral-hdfs core-site.xml file. The latter is expected; the former > is not. Is this intended behavior? > > I expected pyspark to read configs from the spark core-site.xml file. The > moment I remove my AWS credentials from the ephemeral-hdfs config file, > pyspark cannot open files in S3 without me providing the credentials > in-line. > > I also guessed that the config file under /root/mapreduce might be a kind > of base config file that both Spark and Hadoop would read from first, and > then override with configs from the other files. The path to the config > suggests that, but it doesn't appear to be the case. Adding my AWS keys to > that file seemed to affect neither Spark nor ephemeral-hdfs/bin/hadoop. > > Nick > > > On Fri, Mar 7, 2014 at 2:07 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote: > >> Set them as environment variable at boot & configure both stacks to call >> on that.. >> >> Mayur Rustagi >> Ph: +1 (760) 203 3257 >> http://www.sigmoidanalytics.com >> @mayur_rustagi <https://twitter.com/mayur_rustagi> >> >> >> >> On Fri, Mar 7, 2014 at 9:32 AM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> On spinning up a Spark cluster in EC2, I'd like to set a few configs >>> that will allow me to access files in S3 without having to specify my AWS >>> access and secret keys over and over, as described >>> here<http://stackoverflow.com/a/3033403/877069> >>> . >>> >>> The properties are fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. >>> >>> Is there a way to set these properties programmatically so that Spark >>> (via the shell) and Hadoop (via distcp) are both aware of and use the >>> values? >>> >>> I don't think SparkConf does what I need because I want Hadoop to also >>> be aware of my AWS keys. When I set those properties using conf.set() in >>> pyspark, distcp didn't appear to be aware of them. >>> >>> Nick >>> >>> >>> ------------------------------ >>> View this message in context: Setting properties in core-site.xml for >>> Spark and Hadoop to >>> access<http://apache-spark-user-list.1001560.n3.nabble.com/Setting-properties-in-core-site-xml-for-Spark-and-Hadoop-to-access-tp2402.html> >>> Sent from the Apache Spark User List mailing list >>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com. >>> >> >> >