Dear All, I have a tiny CSV file, which is around 250MB. There are only 30 columns in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an another CSV file on disk for later usage.
However, I'm getting pissed off as writing the resultant DataFrame is taking too long, which is about 4 to 5 hours. Nevertheless, the size of the file written on the disk is about 58GB! Here's the sample code that I tried: # Using repartition() myDF.repartition(1).write.format("com.databricks.spark.csv").save("data/file.csv") # Using coalesce() myDF. coalesce(1).write.format("com.databricks.spark.csv").save("data/file.csv") Any better suggestion? ---- Md. Rezaul Karim, BSc, MSc Research Scientist, Fraunhofer FIT, Germany Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany eMail: rezaul.ka...@fit.fraunhofer.de <andrea.berna...@fit.fraunhofer.de> Tel: +49 241 80-21527