Hi All, Thanks for prompt response. Really appreciated! Here's a few more info:
1. Spark version: 2.3.0 2. vCore: 8 3. RAM: 32GB 4. Deploy mode: Spark standalone *Operation performed:* I did transformations using StringIndexer on some columns and null imputations. That's all. *Why writing back into CSV:* I need to write the dataframe into CSV to be used by a non-Spark application. Nevertheless, I need to perform pre-processing on a larger-dataset (about 2GB) and this one is just a simple. So writing into parquet or ORC is not a viable option for me. I was trying to use Spark for only pre-processing. By the way, I tried using Spark builtin CSV library too. Best, ---- Md. Rezaul Karim, BSc, MSc Research Scientist, Fraunhofer FIT, Germany Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany eMail: rezaul.ka...@fit.fraunhofer.de <andrea.berna...@fit.fraunhofer.de> Tel: +49 241 80-21527 On 9 March 2018 at 13:41, Teemu Heikkilä <te...@emblica.fi> wrote: > Sounds like you’re doing something else than just writing the same file > back to disk, what your preprocessing consists? > > Sometimes you can save lot’s of space by using other formats but now we’re > talking over 200x increase in file size so depending on the transformations > for the data you might not get so huge savings by using some other format. > > If you can give more details about what you are doing with the data we > could probably help with your task. > > Slowness probably happens because Spark is using disk to process the data > into single partition for writing the single file, one thing to reconsider > is that if you can merge the product files after the process or even > pre-partition it for it’s final use case. > > - Teemu > > On 9.3.2018, at 12.23, Md. Rezaul Karim <rezaul.ka...@insight-centre.org> > wrote: > > Dear All, > > I have a tiny CSV file, which is around 250MB. There are only 30 columns > in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an > another CSV file on disk for later usage. > > However, I'm getting pissed off as writing the resultant DataFrame is > taking too long, which is about 4 to 5 hours. Nevertheless, the size of the > file written on the disk is about 58GB! > > Here's the sample code that I tried: > > # Using repartition() > myDF.repartition(1).write.format("com.databricks.spark. > csv").save("data/file.csv") > > # Using coalesce() > myDF. coalesce(1).write.format("com.databricks.spark.csv").save(" > data/file.csv") > > > Any better suggestion? > > > > ---- > Md. Rezaul Karim, BSc, MSc > Research Scientist, Fraunhofer FIT, Germany > Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany > eMail: rezaul.ka...@fit.fraunhofer.de <andrea.berna...@fit.fraunhofer.de> > Tel: +49 241 80-21527 <+49%20241%208021527> > > >