Re: SchemaRDD - Parquet - "insertInto" makes many files

Michael Armbrust Thu, 04 Sep 2014 18:01:02 -0700

It depends on the RDD in question exactly where the work will be done. I
believe that if you do a repartition(1) instead of a coalesce it will force
a shuffle so the work will be done distributed and then a single node will
read that shuffled data and write it out.


If you want to write to a single parquet file however, you will at some
point need to block on a single node.


On Thu, Sep 4, 2014 at 2:02 PM, DanteSama <chris.feder...@sojo.com> wrote:

> Yep, that worked out. Does this solution have any performance implications
> past all the work being done on (probably) 1 node?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13501.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: SchemaRDD - Parquet - "insertInto" makes many files

Reply via email to