Best solution I've found so far (no shuffling and as many threads as input
dirs) :
Create an rdd of input dirs, with as many partitions as input dirs
Transform it to an rdd of input files (preserving the partitions by dirs)
Flat-map it with a custom csv parser
Convert rdd to dataframe
Write datafr
This is more or less how I'm doing it now.
Problem is that it creates shuffling in the cluster because the input data
are not collocated according to the partition scheme.
If a reload the output parquet files as a new dataframe, then everything is
fine, but I'd like to avoid shuffling also during
All you need to do is load all the files into one dataframe at once. Then
save the dataframe using partitionBy -
df.write.format("parquet").partitionBy("directoryCol").save("hdfs://path")
Then if you look at the new folder it should look like how you want it I.E -
hdfs://path/dir=dir1/part-r-xxx.
Yes, by parsing the file content, it's possible to recover in which directory
they are.
From: neil90 [via Apache Spark User List]
[mailto:ml-node+s1001560n28083...@n3.nabble.com]
Sent: mercredi 16 novembre 2016 17:41
To: Drooghaag, Benoit (Nokia - BE)
Subject: Re: CSV to parquet prese
Is there anything in the files to let you know which directory they should be
in?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078p28083.html
Sent from the Apache Spark User List mailing list archive at Nabble.co
: CSV to parquet preserving partitioning
Did you try unioning the datasets for each CSV into a single dataset? You may
need to put the directory name into a column so you can partition by it.
On Tue, Nov 15, 2016 at 8:44 AM, benoitdr
mailto:benoit.droogh...@nokia.com>> wrote:
Hello,
I'
Did you try unioning the datasets for each CSV into a single dataset? You
may need to put the directory name into a column so you can partition by it.
On Tue, Nov 15, 2016 at 8:44 AM, benoitdr
wrote:
> Hello,
>
> I'm trying to convert a bunch of csv files to parquet, with the interesting
> case