Sounds like you’re doing something else than just writing the same file back to 
disk, what your preprocessing consists?

Sometimes you can save lot’s of space by using other formats but now we’re 
talking over 200x increase in file size so depending on the transformations for 
the data you might not get so huge savings by using some other format.

If you can give more details about what you are doing with the data we could 
probably help with your task.

Slowness probably happens because Spark is using disk to process the data into 
single partition for writing the single file, one thing to reconsider is that 
if you can merge the product files after the process or even pre-partition it 
for it’s final use case.

- Teemu

> On 9.3.2018, at 12.23, Md. Rezaul Karim <rezaul.ka...@insight-centre.org> 
> wrote:
> 
> Dear All,
> 
> I have a tiny CSV file, which is around 250MB. There are only 30 columns in 
> the DataFrame. Now I'm trying to save the pre-processed DataFrame as an 
> another CSV file on disk for later usage. 
> 
> However, I'm getting pissed off as writing the resultant DataFrame is taking 
> too long, which is about 4 to 5 hours. Nevertheless, the size of the file 
> written on the disk is about 58GB!  
> 
> Here's the sample code that I tried:
> 
> # Using repartition()
> myDF.repartition(1).write.format("com.databricks.spark.csv").save("data/file.csv")
> 
> # Using coalesce()
> myDF. 
> coalesce(1).write.format("com.databricks.spark.csv").save("data/file.csv")
> 
> 
> Any better suggestion? 
> 
> 
> 
> ---- 
> Md. Rezaul Karim, BSc, MSc
> Research Scientist, Fraunhofer FIT, Germany
> Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany
> eMail: rezaul.ka...@fit.fraunhofer.de 
> <mailto:andrea.berna...@fit.fraunhofer.de>
> Tel: +49 241 80-21527

Reply via email to