Hello, I am trying to combine several small text files (each file is approx hundreds of MBs to 2-3 gigs) into one big parquet file.
I am loading each one of them and trying to take a union, however this leads to enormous amounts of partitions, as union keeps on adding the partitions of the input RDDs together. I also tried loading all the files via wildcard, but that behaves almost the same as union i.e. generates a lot of partitions. One of the approach that I thought was to reparititon the rdd generated after each union and then continue the process, but I don't know how efficient that is. Has anyone came across this kind of thing before? - Apurva