Hi Darin, You should read this article. TextFile is very inefficient in S3.
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 Cheers On Wed, Jan 13, 2016 at 11:43 AM Darin McBeath <ddmcbe...@yahoo.com.invalid> wrote: > I'm looking for some suggestions based on other's experiences. > > I currently have a job that I need to run periodically where I need to > read on the order of 1+ million files from an S3 bucket. It is not the > entire bucket (nor does it match a pattern). Instead, I have a list of > random keys that are 'names' for the files in this S3 bucket. The bucket > itself will contain upwards of 60M or more files. > > My current approach has been to get my list of keys, partition on the key, > and then map this to an underlying class that uses the most recent AWS SDK > to retrieve the file from S3 using this key, which then returns the file. > So, in the end, I have an RDD<String>. This works, but I really wonder if > this is the best way. I suspect there might be a better/faster way. > > One thing I've been considering is passing all of the keys (using s3n: > urls) to sc.textFile or sc.wholeTextFiles(since some of my files can have > embedded newlines). But, I wonder how either of these would behave if I > passed literally a million (or more) 'filenames'. > > Before I spend time exploring, I wanted to seek some input. > > Any thoughts would be appreciated. > > Darin. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >