Hello,

I am trying to combine several small text files (each file is approx
hundreds of MBs to 2-3 gigs) into one big parquet file.

I am loading each one of them and trying to take a union, however this
leads to enormous amounts of partitions, as union keeps on adding the
partitions of the input RDDs together.

I also tried loading all the files via wildcard, but that behaves almost
the same as union i.e. generates a lot of partitions.

One of the approach that I thought was to reparititon the rdd generated
after each union and then continue the process, but I don't know how
efficient that is.

Has anyone came across this kind of thing before?

- Apurva

Reply via email to