use s3a://, especially on hadoop-2.7+. It uses the amazon libs and is faster
for directory lookups than jets3t
> On 13 Jan 2016, at 11:42, Darin McBeath wrote:
>
> I'm looking for some suggestions based on other's experiences.
>
> I currently have a job that I need to run periodically where I
h
> what I'm doing already. Was just thinking there might be a better way.
>
> Darin.
> --
> *From:* Daniel Imberman
> *To:* Darin McBeath ; User
> *Sent:* Wednesday, January 13, 2016 2:48 PM
> *Subject:* Re: Best practice for retrievin
Hi Darin,
You should read this article. TextFile is very inefficient in S3.
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219
Cheers
On Wed, Jan 13, 2016 at 11:43 AM Darin McBeath
wrote:
> I'm looking for some suggestions based on other's experiences.
>
> I currently
I'm looking for some suggestions based on other's experiences.
I currently have a job that I need to run periodically where I need to read on
the order of 1+ million files from an S3 bucket. It is not the entire bucket
(nor does it match a pattern). Instead, I have a list of random keys that a