On 28 Apr 2017, at 16:10, Anubhav Agarwal 
<anubha...@gmail.com<mailto:anubha...@gmail.com>> wrote:

Are you using Spark's textFiles method? If so, go through this blog :-
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219


old/dated blog post.

If you get the Hadoop 2.8 binaries on your classpath, s3a does a full directory 
tree listing if you give it a simple path like "s3a://bucket/events". The 
example in that post was using a complex wildcard which hasn't yet been speeded 
up as it's pretty hard to do it in a way which works effectively everywhere.

Having all your data in 1 dir works nicely.


Anubhav

On Mon, Apr 24, 2017 at 12:48 PM, Afshin, Bardia 
<bardia.afs...@capitalone.com<mailto:bardia.afs...@capitalone.com>> wrote:
Hi there,

I have a process that downloads thousands of files from s3 bucket, removes a 
set of columns from it, and upload it to s3.

S3 is currently not  the bottleneck, having a Single Master Node Spark instance 
is the bottleneck. One approach is to distribute the files on multiple Spark 
Master Node workers, that will make it faster.

yes, > 1 worker and, if the work can be partitioned


Question:

1.       Is there a way to utilize master / slave node on Spark to distribute 
this downloading and processing of files – so it can say do 10 files at a time?


yes, they are called RDDs/Dataframes & Datasets


If you are doing all the processing on the spark driver, then you aren't really 
using spark much, more just processing them in Scala

To get a dataframe

val df = SparkSession.read.format("csv").load("s3a://bucket/data")

You now have a dataset on all files in the directory /data in the bucket, which 
will be partitioned how spark decides (which depends on: # of workers, 
compression format used and its splittability). Assuming you can configure the 
dataframe with the column structure, you can filter aggressively by selecting 
only those columns you want

val filteredDf = df.select("rental", "start_time")
filteredDf.save(hdfs://final/processed<s3a://final/processed>")

then, once you've got all the data done, copy them up to S3 via distcp

I'd recommend you start doing this with a small number of files locally, 
getting the code working, then see if you can use it with s3 as the source/dest 
of data, again, locally if you want (it's just slow), then move to in-EC2 for 
the bandwidth.

Bandwidth wise, there are some pretty major performance issues with the s3n 
connector, S3a in Hadoop 2.7+ works, with Hadoop 2.8 having a lot more 
speedupm, especially when using orc and parquet as a source, where there's a 
special "random access mode".

futrher reading
https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.14.1/bk_hdcloud-aws/content/s3-spark/index.html

https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.14.1/bk_hdcloud-aws/content/s3-performance/index.html


2.       Is there a way to scale workers with Spark downloading and processing 
files, even if they are all Single Master Node?



I think there may be some terminology confusion here. You are going to have to 
have one process which is the spark driver: either on your client machine, 
deployed somewhere in the cluster via YARN/Mesos, or running on a static 
location withing a spark standalone cluster. Everything other than the driver 
process is a work, which will do the work.



Reply via email to