Writing a DataFrame is taking too long and huge space

Md. Rezaul Karim Fri, 09 Mar 2018 02:24:07 -0800

Dear All,

I have a tiny CSV file, which is around 250MB. There are only 30 columns in
the DataFrame. Now I'm trying to save the pre-processed DataFrame as an
another CSV file on disk for later usage.


However, I'm getting pissed off as writing the resultant DataFrame is
taking too long, which is about 4 to 5 hours. Nevertheless, the size of the
file written on the disk is about 58GB!

Here's the sample code that I tried:

# Using repartition()
myDF.repartition(1).write.format("com.databricks.spark.csv").save("data/file.csv")

# Using coalesce()
myDF.
coalesce(1).write.format("com.databricks.spark.csv").save("data/file.csv")


Any better suggestion?



----
Md. Rezaul Karim, BSc, MSc
Research Scientist, Fraunhofer FIT, Germany

Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany

eMail: rezaul.ka...@fit.fraunhofer.de <andrea.berna...@fit.fraunhofer.de>
Tel: +49 241 80-21527

Writing a DataFrame is taking too long and huge space

Reply via email to