Re: Spark on EMR with S3 example (Python)

Sujit Pal Tue, 14 Jul 2015 12:15:13 -0700

Hi Roberto,

I have written PySpark code that reads from private S3 buckets, it should
be similar for public S3 buckets as well. You need to set the AWS access
and secret keys into the SparkContext, then you can access the S3 folders
and files with their s3n:// paths. Something like this:

sc = SparkContext()
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",
aws_secret_key)

mydata = sc.textFile("s3n://mybucket/my_input_folder") \
                    .map(lambda x: do_something(x)) \
                    .saveAsTextFile("s3://mybucket/my_output_folder")
...

You can read and write sequence files as well - these are the only 2
formats I have tried, but I'm sure the other ones like JSON would work
also. Another approach is to embed the AWS access key and secret key into
the s3n:// path.

I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its
an older version but not sure) but it works for access.

Hope this helps,
Sujit

On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto <rpagli...@appcomsci.com
> wrote:

> Is there an example about how to load data from a public S3 bucket in
> Python? I haven’t found any.
>
>
>
> Thank you,
>
>
>

Re: Spark on EMR with S3 example (Python)

Reply via email to