See: https://github.com/rdblue/s3committer and
https://www.youtube.com/watch?v=8F2Jqw5_OnI&feature=youtu.be


On Mon, Oct 2, 2017 at 11:31 AM, Marcelo Vanzin <van...@cloudera.com> wrote:

> You don't need to collect data in the driver to save it. The code in
> the original question doesn't use "collect()", so it's actually doing
> a distributed write.
>
>
> On Mon, Oct 2, 2017 at 11:26 AM, JG Perrin <jper...@lumeris.com> wrote:
> > Steve,
> >
> >
> >
> > If I refer to the collect() API, it says “Running collect requires moving
> > all the data into the application's driver process, and doing so on a
> very
> > large dataset can crash the driver process with OutOfMemoryError.” So why
> > would you need a distributed FS?
> >
> >
> >
> > jg
> >
> >
> >
> > From: Steve Loughran [mailto:ste...@hortonworks.com]
> > Sent: Saturday, September 30, 2017 6:10 AM
> > To: JG Perrin <jper...@lumeris.com>
> > Cc: Alexander Czech <alexander.cz...@googlemail.com>;
> user@spark.apache.org
> > Subject: Re: HDFS or NFS as a cache?
> >
> >
> >
> >
> >
> > On 29 Sep 2017, at 20:03, JG Perrin <jper...@lumeris.com> wrote:
> >
> >
> >
> > You will collect in the driver (often the master) and it will save the
> data,
> > so for saving, you will not have to set up HDFS.
> >
> >
> >
> > no, it doesn't work quite like that.
> >
> >
> >
> > 1. workers generate their data and save somwhere
> >
> > 2. on "task commit" they move their data to some location where it will
> be
> > visible for "job commit" (rename, upload, whatever)
> >
> > 3. job commit —which is done in the driver,— takes all the committed task
> > data and makes it visible in the destination directory.
> >
> > 4. Then they create a _SUCCESS file to say "done!"
> >
> >
> >
> >
> >
> > This is done with Spark talking between workers and drivers to guarantee
> > that only one task working on a specific part of the data commits their
> > work, only
> >
> > committing the job once all tasks have finished
> >
> >
> >
> > The v1 mapreduce committer implements (2) by moving files under a job
> > attempt dir, and (3) by moving it from the job attempt dir to the
> > destination. one rename per task commit, another rename of every file on
> job
> > commit. In HFDS, Azure wasb and other stores with an O(1) atomic rename,
> > this isn't *too* expensve, though that final job commit rename still
> takes
> > time to list and move lots of files
> >
> >
> >
> > The v2 committer implements (2) by renaming to the destination directory
> and
> > (3) as a no-op. Rename in the tasks then, but not not that second,
> > serialized one at the end
> >
> >
> >
> > There's no copy of data from workers to driver, instead you need a shared
> > output filesystem so that the job committer can do its work alongside the
> > tasks.
> >
> >
> >
> > There are alternatives committer agorithms,
> >
> >
> >
> > 1. look at Ryan Blue's talk: https://www.youtube.com/watch?v=BgHrff5yAQo
> >
> > 2. IBM Stocator paper (https://arxiv.org/abs/1709.01812) and code
> > (https://github.com/SparkTC/stocator/)
> >
> > 3. Ongoing work in Hadoop itself for better committers. Goal: year end &
> > Hadoop 3.1 https://issues.apache.org/jira/browse/HADOOP-13786 . The
> oode is
> > all there, Parquet is a troublespot, and more testing is welcome from
> anyone
> > who wants to help.
> >
> > 4. Databricks have "something"; specifics aren't covered, but I assume
> its
> > dynamo DB based
> >
> >
> >
> >
> >
> > -Steve
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: Alexander Czech [mailto:alexander.cz...@googlemail.com]
> > Sent: Friday, September 29, 2017 8:15 AM
> > To: user@spark.apache.org
> > Subject: HDFS or NFS as a cache?
> >
> >
> >
> > I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write
> > parquet files to S3. But the S3 performance for various reasons is bad
> when
> > I access s3 through the parquet write method:
> >
> > df.write.parquet('s3a://bucket/parquet')
> >
> > Now I want to setup a small cache for the parquet output. One output is
> > about 12-15 GB in size. Would it be enough to setup a NFS-directory on
> the
> > master, write the output to it and then move it to S3? Or should I setup
> a
> > HDFS on the Master? Or should I even opt for an additional cluster
> running a
> > HDFS solution on more than one node?
> >
> > thanks!
> >
> >
>
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to