Upon completion of the 2 hour part of the run, the files did not exist in
the output directory? One thing that is done serially is deleting any
remaining files from _temporary, so perhaps there was a lot of data
remaining in _temporary but the committed data had already been moved.

I am, unfortunately, not aware of other issues that would cause this to be
so slow.

On Tue, Jan 27, 2015 at 6:54 PM, Josh Walton <j...@openbookben.com> wrote:

> I'm not sure how to confirm how the moving is happening, however, one of
> the jobs just completed that I was talking about with 9k files of 4mb each.
> Spark UI showed the job being complete after ~2 hours. The last four hours
> of the job was just moving the files from _temporary to their final
> destination. The tasks for the write were definitely shown as complete, no
> logging is happening on the master or workers. The last line of my java
> code logs, but the job sits there as the moving of files happens.
>
> On Tue, Jan 27, 2015 at 7:24 PM, Aaron Davidson <ilike...@gmail.com>
> wrote:
>
>> This renaming from _temporary to the final location is actually done by
>> executors, in parallel, for saveAsTextFile. It should be performed by each
>> task individually before it returns.
>>
>> I have seen an issue similar to what you mention dealing with Hive code
>> which did the renaming serially on the driver, which is very slow for S3
>> (and possibly Google Storage as well), as it actually copies the data
>> rather than doing a metadata-only operation during rename. However, this
>> should not be an issue in this case.
>>
>> Could you confirm how the moving is happening -- i.e., on the executors
>> or the driver?
>>
>> On Tue, Jan 27, 2015 at 4:31 PM, jwalton <j...@openbookben.com> wrote:
>>
>>> We are running spark in Google Compute Engine using their One-Click
>>> Deploy.
>>> By doing so, we get their Google Cloud Storage connector for hadoop for
>>> free
>>> meaning we can specify gs:// paths for input and output.
>>>
>>> We have jobs that take a couple of hours, end up with ~9k partitions
>>> which
>>> means 9k output files. After the job is "complete" it then moves the
>>> output
>>> files from our $output_path/_temporary to $output_path. That process can
>>> take longer than the job itself depending on the circumstances. The job I
>>> mentioned previously outputs ~4mb files, and so far has copied 1/3 of the
>>> files in 1.5 hours from _temporary to the final destination.
>>>
>>> Is there a solution to this besides reducing the number of partitions?
>>> Anyone else run into similar issues elsewhere? I don't remember this
>>> being
>>> an issue with Map Reduce jobs and hadoop, however, I probably wasn't
>>> tracking the transfer of the output files like I am with Spark.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/performance-of-saveAsTextFile-moving-files-from-temporary-tp21397.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to