On 28 Apr 2017, at 16:10, Anubhav Agarwal <anubha...@gmail.com<mailto:anubha...@gmail.com>> wrote:
Are you using Spark's textFiles method? If so, go through this blog :- http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 old/dated blog post. If you get the Hadoop 2.8 binaries on your classpath, s3a does a full directory tree listing if you give it a simple path like "s3a://bucket/events". The example in that post was using a complex wildcard which hasn't yet been speeded up as it's pretty hard to do it in a way which works effectively everywhere. Having all your data in 1 dir works nicely. Anubhav On Mon, Apr 24, 2017 at 12:48 PM, Afshin, Bardia <bardia.afs...@capitalone.com<mailto:bardia.afs...@capitalone.com>> wrote: Hi there, I have a process that downloads thousands of files from s3 bucket, removes a set of columns from it, and upload it to s3. S3 is currently not the bottleneck, having a Single Master Node Spark instance is the bottleneck. One approach is to distribute the files on multiple Spark Master Node workers, that will make it faster. yes, > 1 worker and, if the work can be partitioned Question: 1. Is there a way to utilize master / slave node on Spark to distribute this downloading and processing of files – so it can say do 10 files at a time? yes, they are called RDDs/Dataframes & Datasets If you are doing all the processing on the spark driver, then you aren't really using spark much, more just processing them in Scala To get a dataframe val df = SparkSession.read.format("csv").load("s3a://bucket/data") You now have a dataset on all files in the directory /data in the bucket, which will be partitioned how spark decides (which depends on: # of workers, compression format used and its splittability). Assuming you can configure the dataframe with the column structure, you can filter aggressively by selecting only those columns you want val filteredDf = df.select("rental", "start_time") filteredDf.save(hdfs://final/processed<s3a://final/processed>") then, once you've got all the data done, copy them up to S3 via distcp I'd recommend you start doing this with a small number of files locally, getting the code working, then see if you can use it with s3 as the source/dest of data, again, locally if you want (it's just slow), then move to in-EC2 for the bandwidth. Bandwidth wise, there are some pretty major performance issues with the s3n connector, S3a in Hadoop 2.7+ works, with Hadoop 2.8 having a lot more speedupm, especially when using orc and parquet as a source, where there's a special "random access mode". futrher reading https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.14.1/bk_hdcloud-aws/content/s3-spark/index.html https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.14.1/bk_hdcloud-aws/content/s3-performance/index.html 2. Is there a way to scale workers with Spark downloading and processing files, even if they are all Single Master Node? I think there may be some terminology confusion here. You are going to have to have one process which is the spark driver: either on your client machine, deployed somewhere in the cluster via YARN/Mesos, or running on a static location withing a spark standalone cluster. Everything other than the driver process is a work, which will do the work.