Hive can launch another job with strategy to merged the small files, probably we can also do that in the future release.
From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, September 05, 2014 8:59 AM To: DanteSama Cc: u...@spark.incubator.apache.org Subject: Re: SchemaRDD - Parquet - "insertInto" makes many files It depends on the RDD in question exactly where the work will be done. I believe that if you do a repartition(1) instead of a coalesce it will force a shuffle so the work will be done distributed and then a single node will read that shuffled data and write it out. If you want to write to a single parquet file however, you will at some point need to block on a single node. On Thu, Sep 4, 2014 at 2:02 PM, DanteSama <chris.feder...@sojo.com<mailto:chris.feder...@sojo.com>> wrote: Yep, that worked out. Does this solution have any performance implications past all the work being done on (probably) 1 node? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13501.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>