Did you try unioning the datasets for each CSV into a single dataset? You may need to put the directory name into a column so you can partition by it.
On Tue, Nov 15, 2016 at 8:44 AM, benoitdr <benoit.droogh...@nokia.com> wrote: > Hello, > > I'm trying to convert a bunch of csv files to parquet, with the interesting > case that the input csv files are already "partitioned" by directory. > All the input files have the same set of columns. > The input files structure looks like : > > /path/dir1/file1.csv > /path/dir1/file2.csv > /path/dir2/file3.csv > /path/dir3/file4.csv > /path/dir3/file5.csv > /path/dir3/file6.csv > > I'd like to read those files and write their data to a parquet table in > hdfs, preserving the partitioning (partitioned by input directory), and > such > as there is a single output file per partition. > The output files strucutre should look like : > > hdfs://path/dir=dir1/part-r-xxx.gz.parquet > hdfs://path/dir=dir2/part-r-yyy.gz.parquet > hdfs://path/dir=dir3/part-r-zzz.gz.parquet > > > The best solution I have found so far is to loop among the input > directories, loading the csv files in a dataframe and to write the > dataframe > in the target partition. > But this not efficient since I want a single output file per partition, the > writing to hdfs is a single tasks that blocks the loop. > I wonder how to achieve this with a maximum of parallelism (and without > shuffling the data in the cluster). > > Thanks ! > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >