We are running spark in Google Compute Engine using their One-Click Deploy.
By doing so, we get their Google Cloud Storage connector for hadoop for free
meaning we can specify gs:// paths for input and output.

We have jobs that take a couple of hours, end up with ~9k partitions which
means 9k output files. After the job is "complete" it then moves the output
files from our $output_path/_temporary to $output_path. That process can
take longer than the job itself depending on the circumstances. The job I
mentioned previously outputs ~4mb files, and so far has copied 1/3 of the
files in 1.5 hours from _temporary to the final destination.

Is there a solution to this besides reducing the number of partitions?
Anyone else run into similar issues elsewhere? I don't remember this being
an issue with Map Reduce jobs and hadoop, however, I probably wasn't
tracking the transfer of the output files like I am with Spark.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/performance-of-saveAsTextFile-moving-files-from-temporary-tp21397.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to