everything works best if your sources are a few tens to hundreds of MB or
more

Are you referring to the size of the zip file or individual unzipped files?

Any issues with storing a 60 mb zipped file containing heaps of text files
inside?

On 11 Apr. 2017 9:09 pm, "Steve Loughran" <ste...@hortonworks.com> wrote:

>
> > On 11 Apr 2017, at 11:07, Zeming Yu <zemin...@gmail.com> wrote:
> >
> > Hi all,
> >
> > I'm a beginner with spark, and I'm wondering if someone could provide
> guidance on the following 2 questions I have.
> >
> > Background: I have a data set growing by 6 TB p.a. I plan to use spark
> to read in all the data, manipulate it and build a predictive model on it
> (say GBM) I plan to store the data in S3, and use EMR to launch spark,
> reading in data from S3.
> >
> > 1. Which option is best for storing the data on S3 for the purpose of
> analysing it in EMR spark?
> > Option A: storing the 6TB file as 173 million individual text files
> > Option B: zipping up the above 173 million text files as 240,000 zip
> files
> > Option C: appending the individual text files, so have 240,000 text
> files p.a.
> > Option D: combining the text files even further
> >
>
> everything works best if your sources are a few tens to hundreds of MB or
> more of your data, work can be partitioned up by file. If you use more
> structured formats (avro compressed with snappy, etc), you can throw > 1
> executor at work inside a file. Structure is handy all round, even if its
> just adding timestamp and provenance columns to each data file.
>
> there's the HAR file format from Hadoop which can merge lots of small
> files into larger ones, allowing work to be scheduled per har file.
> Recommended for HDFS as it hates small files, on S3 you still have limits
> on small files (including throttling of HTTP requests to shards of a
> bucket), but they are less significant.
>
> One thing to be aware is that the s3 clients spark use are very
> inefficient in listing wide directory trees, and Spark not always the best
> at partitioning work because of this. You can accidentally create a very
> inefficient tree structure like datasets/year=2017/month=5/day=10/hour=12/,
> with only one file per hour. Listing and partitioning suffers here, and
> while s3a on Hadoop 2.8 is better here, Spark hasn't yet fully adapted to
> those changes (use of specific API calls). There's also a lot more to be
> done in S3A to handle wildcards in the directory tree much more efficiently
> (HADOOP-13204); needs to address pattens like 
> (datasets/year=201?/month=*/day=10)
> without treewalking and without fetching too much data from wildcards near
> the top of the tree. We need to avoid implementing something which works
> well on *my* layouts, but absolutely dies on other people's. As is usual in
> OSS, help welcome; early testing here as critical as coding, so as to
> ensure things will work with your file structures
>
> -Steve
>
>
> > 2. Any recommendations on the EMR set up to analyse the 6TB of data all
> at once and build a GBM, in terms of
> > 1) The type of EC2 instances I need?
> > 2) The number of such instances I need?
> > 3) Rough estimate of cost?
> >
>
> no opinion there
>
> >
> > Thanks so much,
> > Zeming
> >
>
>

Reply via email to