RE: CSV to parquet preserving partitioning

2016-11-23 Thread benoitdr
Best solution I've found so far (no shuffling and as many threads as input dirs) : Create an rdd of input dirs, with as many partitions as input dirs Transform it to an rdd of input files (preserving the partitions by dirs) Flat-map it with a custom csv parser Convert rdd to dataframe Write datafr

RE: CSV to parquet preserving partitioning

2016-11-18 Thread benoitdr
This is more or less how I'm doing it now. Problem is that it creates shuffling in the cluster because the input data are not collocated according to the partition scheme. If a reload the output parquet files as a new dataframe, then everything is fine, but I'd like to avoid shuffling also during

RE: CSV to parquet preserving partitioning

2016-11-16 Thread neil90
All you need to do is load all the files into one dataframe at once. Then save the dataframe using partitionBy - df.write.format("parquet").partitionBy("directoryCol").save("hdfs://path") Then if you look at the new folder it should look like how you want it I.E - hdfs://path/dir=dir1/part-r-xxx.

RE: CSV to parquet preserving partitioning

2016-11-16 Thread benoitdr
Yes, by parsing the file content, it's possible to recover in which directory they are. From: neil90 [via Apache Spark User List] [mailto:ml-node+s1001560n28083...@n3.nabble.com] Sent: mercredi 16 novembre 2016 17:41 To: Drooghaag, Benoit (Nokia - BE) Subject: Re: CSV to parquet prese

Re: CSV to parquet preserving partitioning

2016-11-16 Thread neil90
Is there anything in the files to let you know which directory they should be in? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078p28083.html Sent from the Apache Spark User List mailing list archive at Nabble.co

RE: CSV to parquet preserving partitioning

2016-11-16 Thread Drooghaag, Benoit (Nokia - BE)
: CSV to parquet preserving partitioning Did you try unioning the datasets for each CSV into a single dataset? You may need to put the directory name into a column so you can partition by it. On Tue, Nov 15, 2016 at 8:44 AM, benoitdr mailto:benoit.droogh...@nokia.com>> wrote: Hello, I'

Re: CSV to parquet preserving partitioning

2016-11-15 Thread Daniel Siegmann
Did you try unioning the datasets for each CSV into a single dataset? You may need to put the directory name into a column so you can partition by it. On Tue, Nov 15, 2016 at 8:44 AM, benoitdr wrote: > Hello, > > I'm trying to convert a bunch of csv files to parquet, with the interesting > case