Upon completion of the 2 hour part of the run, the files did not exist in the output directory? One thing that is done serially is deleting any remaining files from _temporary, so perhaps there was a lot of data remaining in _temporary but the committed data had already been moved.
I am, unfortunately, not aware of other issues that would cause this to be so slow. On Tue, Jan 27, 2015 at 6:54 PM, Josh Walton <j...@openbookben.com> wrote: > I'm not sure how to confirm how the moving is happening, however, one of > the jobs just completed that I was talking about with 9k files of 4mb each. > Spark UI showed the job being complete after ~2 hours. The last four hours > of the job was just moving the files from _temporary to their final > destination. The tasks for the write were definitely shown as complete, no > logging is happening on the master or workers. The last line of my java > code logs, but the job sits there as the moving of files happens. > > On Tue, Jan 27, 2015 at 7:24 PM, Aaron Davidson <ilike...@gmail.com> > wrote: > >> This renaming from _temporary to the final location is actually done by >> executors, in parallel, for saveAsTextFile. It should be performed by each >> task individually before it returns. >> >> I have seen an issue similar to what you mention dealing with Hive code >> which did the renaming serially on the driver, which is very slow for S3 >> (and possibly Google Storage as well), as it actually copies the data >> rather than doing a metadata-only operation during rename. However, this >> should not be an issue in this case. >> >> Could you confirm how the moving is happening -- i.e., on the executors >> or the driver? >> >> On Tue, Jan 27, 2015 at 4:31 PM, jwalton <j...@openbookben.com> wrote: >> >>> We are running spark in Google Compute Engine using their One-Click >>> Deploy. >>> By doing so, we get their Google Cloud Storage connector for hadoop for >>> free >>> meaning we can specify gs:// paths for input and output. >>> >>> We have jobs that take a couple of hours, end up with ~9k partitions >>> which >>> means 9k output files. After the job is "complete" it then moves the >>> output >>> files from our $output_path/_temporary to $output_path. That process can >>> take longer than the job itself depending on the circumstances. The job I >>> mentioned previously outputs ~4mb files, and so far has copied 1/3 of the >>> files in 1.5 hours from _temporary to the final destination. >>> >>> Is there a solution to this besides reducing the number of partitions? >>> Anyone else run into similar issues elsewhere? I don't remember this >>> being >>> an issue with Map Reduce jobs and hadoop, however, I probably wasn't >>> tracking the transfer of the output files like I am with Spark. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/performance-of-saveAsTextFile-moving-files-from-temporary-tp21397.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >