Re: SparkSQL integration issue with AWS S3a

Jerry Lam Wed, 30 Dec 2015 11:38:25 -0800

Hi Kostiantyn,

I want to confirm that it works first by using hdfs-site.xml. If yes, you could 
define different spark-{user-x}.conf and source them during spark-submit. let 
us know if hdfs-site.xml works first. It should.


Best Regards,

Jerry

Sent from my iPhone

> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev 
> <[email protected]> wrote:
> 
> Hi Jerry,
> 
> I want to run different jobs on different S3 buckets - different AWS creds - 
> on the same instances. Could you shed some light if it's possible to achieve 
> with hdfs-site?
> 
> Thank you,
> Konstantin Kudryavtsev
> 
>> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam <[email protected]> wrote:
>> Hi Kostiantyn,
>> 
>> Can you define those properties in hdfs-site.xml and make sure it is visible 
>> in the class path when you spark-submit? It looks like a conf sourcing issue 
>> to me. 
>> 
>> Cheers,
>> 
>> Sent from my iPhone
>> 
>>> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev 
>>> <[email protected]> wrote:
>>> 
>>> Chris,
>>> 
>>> thanks for the hist with AIM roles, but in my case  I need to run different 
>>> jobs with different S3 permissions on the same cluster, so this approach 
>>> doesn't work for me as far as I understood it
>>> 
>>> Thank you,
>>> Konstantin Kudryavtsev
>>> 
>>>> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly <[email protected]> wrote:
>>>> couple things:
>>>> 
>>>> 1) switch to IAM roles if at all possible - explicitly passing AWS 
>>>> credentials is a long and lonely road in the end
>>>> 
>>>> 2) one really bad workaround/hack is to run a job that hits every worker 
>>>> and writes the credentials to the proper location (~/.awscredentials or 
>>>> whatever)
>>>> 
>>>> ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle 
>>>> autoscaling, but i'm mentioning it anyway as it is a temporary fix.
>>>> 
>>>> if you switch to IAM roles, things become a lot easier as you can 
>>>> authorize all of the EC2 instances in the cluster - and handles 
>>>> autoscaling very well - and at some point, you will want to autoscale.
>>>> 
>>>>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev 
>>>>> <[email protected]> wrote:
>>>>> Chris,
>>>>> 
>>>>>  good question, as you can see from the code I set up them on driver, so 
>>>>> I expect they will be propagated to all nodes, won't them?
>>>>> 
>>>>> Thank you,
>>>>> Konstantin Kudryavtsev
>>>>> 
>>>>>> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly <[email protected]> wrote:
>>>>>> are the credentials visible from each Worker node to all the Executor 
>>>>>> JVMs on each Worker?
>>>>>> 
>>>>>>> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev 
>>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>> Dear Spark community,
>>>>>>> 
>>>>>>> I faced the following issue with trying accessing data on S3a, my code 
>>>>>>> is the following:
>>>>>>> 
>>>>>>> val sparkConf = new SparkConf()
>>>>>>> 
>>>>>>> val sc = new SparkContext(sparkConf)
>>>>>>> sc.hadoopConfiguration.set("fs.s3a.impl", 
>>>>>>> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>>>>>>> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
>>>>>>> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>>>>>>> val sqlContext = SQLContext.getOrCreate(sc)
>>>>>>> val df = sqlContext.read.parquet(...)
>>>>>>> df.count
>>>>>>> 
>>>>>>> It results in the following exception and log messages:
>>>>>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
>>>>>>> credentials from BasicAWSCredentialsProvider: Access key or secret key 
>>>>>>> is null
>>>>>>> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
>>>>>>> metadata service at URL: 
>>>>>>> http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>>>>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
>>>>>>> credentials from InstanceProfileCredentialsProvider: The requested 
>>>>>>> metadata is not found at 
>>>>>>> http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>>>>>> 15/12/30 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 
>>>>>>> (TID 3)
>>>>>>> com.amazonaws.AmazonClientException: Unable to load AWS credentials 
>>>>>>> from any provider in the chain
>>>>>>>         at 
>>>>>>> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>>>>>>         at 
>>>>>>> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>>>>>>>         at 
>>>>>>> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>>>>>>>         at 
>>>>>>> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>>>>>>>         at 
>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>>>>>>> 
>>>>>>> I run standalone spark 1.5.2 and using hadoop 2.7.1
>>>>>>> 
>>>>>>> any ideas/workarounds?
>>>>>>> 
>>>>>>> AWS credentials are correct for this bucket
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> Konstantin Kudryavtsev
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> Chris Fregly
>>>> Principal Data Solutions Engineer
>>>> IBM Spark Technology Center, San Francisco, CA
>>>> http://spark.tc | http://advancedspark.com
>

Re: SparkSQL integration issue with AWS S3a

Reply via email to