RE: SchemaRDD - Parquet - "insertInto" makes many files

Cheng, Hao Thu, 04 Sep 2014 18:45:06 -0700

Hive can launch another job with strategy to merged the small files, probably 
we can also do that in the future release.

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, September 05, 2014 8:59 AM
To: DanteSama
Cc: u...@spark.incubator.apache.org
Subject: Re: SchemaRDD - Parquet - "insertInto" makes many files

It depends on the RDD in question exactly where the work will be done. I 
believe that if you do a repartition(1) instead of a coalesce it will force a 
shuffle so the work will be done distributed and then a single node will read 
that shuffled data and write it out.

If you want to write to a single parquet file however, you will at some point 
need to block on a single node.

On Thu, Sep 4, 2014 at 2:02 PM, DanteSama 
<chris.feder...@sojo.com<mailto:chris.feder...@sojo.com>> wrote:
Yep, that worked out. Does this solution have any performance implications
past all the work being done on (probably) 1 node?

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13501.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

RE: SchemaRDD - Parquet - "insertInto" makes many files

Reply via email to