Re: Best practice for retrieving over 1 million files from S3

Daniel Imberman Wed, 13 Jan 2016 11:49:04 -0800

Hi Darin,

You should read this article. TextFile is very inefficient in S3.


http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219

Cheers

On Wed, Jan 13, 2016 at 11:43 AM Darin McBeath <ddmcbe...@yahoo.com.invalid>
wrote:

> I'm looking for some suggestions based on other's experiences.
>
> I currently have a job that I need to run periodically where I need to
> read on the order of 1+ million files from an S3 bucket.  It is not the
> entire bucket (nor does it match a pattern).  Instead, I have a list of
> random keys that are 'names' for the files in this S3 bucket.  The bucket
> itself will contain upwards of 60M or more files.
>
> My current approach has been to get my list of keys, partition on the key,
> and then map this to an underlying class that uses the most recent AWS SDK
> to retrieve the file from S3 using this key, which then returns the file.
> So, in the end, I have an RDD<String>.  This works, but I really wonder if
> this is the best way.  I suspect there might be a better/faster way.
>
> One thing I've been considering is passing all of the keys (using s3n:
> urls) to sc.textFile or sc.wholeTextFiles(since some of my files can have
> embedded newlines).  But, I wonder how either of these would behave if I
> passed literally a million (or more) 'filenames'.
>
> Before I spend time exploring, I wanted to seek some input.
>
> Any thoughts would be appreciated.
>
> Darin.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Best practice for retrieving over 1 million files from S3

Reply via email to