This is more or less how I'm doing it now. Problem is that it creates shuffling in the cluster because the input data are not collocated according to the partition scheme.
If a reload the output parquet files as a new dataframe, then everything is fine, but I'd like to avoid shuffling also during the ETL phase. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078p28103.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org