This renaming from _temporary to the final location is actually done by
executors, in parallel, for saveAsTextFile. It should be performed by each
task individually before it returns.

I have seen an issue similar to what you mention dealing with Hive code
which did the renaming serially on the driver, which is very slow for S3
(and possibly Google Storage as well), as it actually copies the data
rather than doing a metadata-only operation during rename. However, this
should not be an issue in this case.

Could you confirm how the moving is happening -- i.e., on the executors or
the driver?

On Tue, Jan 27, 2015 at 4:31 PM, jwalton <j...@openbookben.com> wrote:

> We are running spark in Google Compute Engine using their One-Click Deploy.
> By doing so, we get their Google Cloud Storage connector for hadoop for
> free
> meaning we can specify gs:// paths for input and output.
>
> We have jobs that take a couple of hours, end up with ~9k partitions which
> means 9k output files. After the job is "complete" it then moves the output
> files from our $output_path/_temporary to $output_path. That process can
> take longer than the job itself depending on the circumstances. The job I
> mentioned previously outputs ~4mb files, and so far has copied 1/3 of the
> files in 1.5 hours from _temporary to the final destination.
>
> Is there a solution to this besides reducing the number of partitions?
> Anyone else run into similar issues elsewhere? I don't remember this being
> an issue with Map Reduce jobs and hadoop, however, I probably wasn't
> tracking the transfer of the output files like I am with Spark.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/performance-of-saveAsTextFile-moving-files-from-temporary-tp21397.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to