We are running spark in Google Compute Engine using their One-Click Deploy. By doing so, we get their Google Cloud Storage connector for hadoop for free meaning we can specify gs:// paths for input and output.
We have jobs that take a couple of hours, end up with ~9k partitions which means 9k output files. After the job is "complete" it then moves the output files from our $output_path/_temporary to $output_path. That process can take longer than the job itself depending on the circumstances. The job I mentioned previously outputs ~4mb files, and so far has copied 1/3 of the files in 1.5 hours from _temporary to the final destination. Is there a solution to this besides reducing the number of partitions? Anyone else run into similar issues elsewhere? I don't remember this being an issue with Map Reduce jobs and hadoop, however, I probably wasn't tracking the transfer of the output files like I am with Spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/performance-of-saveAsTextFile-moving-files-from-temporary-tp21397.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org