Re: Writing a DataFrame is taking too long and huge space

Md. Rezaul Karim Fri, 09 Mar 2018 05:28:31 -0800

Hi All,

Thanks for prompt response. Really appreciated! Here's a few more info:


1. Spark version: 2.3.0
2. vCore: 8
3. RAM: 32GB
4. Deploy mode: Spark standalone

*Operation performed:* I did transformations using StringIndexer on some
columns and null imputations. That's all.

*Why writing back into CSV:* I need to write the dataframe into CSV to be
used by a non-Spark application. Nevertheless, I need to perform
pre-processing on a larger-dataset (about 2GB) and this one is just a
simple. So writing into parquet or ORC is not a viable option for me.

I was trying to use Spark for only pre-processing. By the way, I tried
using Spark builtin CSV library too.




Best,


----
Md. Rezaul Karim, BSc, MSc
Research Scientist, Fraunhofer FIT, Germany

Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany

eMail: rezaul.ka...@fit.fraunhofer.de <andrea.berna...@fit.fraunhofer.de>
Tel: +49 241 80-21527

On 9 March 2018 at 13:41, Teemu Heikkilä <te...@emblica.fi> wrote:

> Sounds like you’re doing something else than just writing the same file
> back to disk, what your preprocessing consists?
>
> Sometimes you can save lot’s of space by using other formats but now we’re
> talking over 200x increase in file size so depending on the transformations
> for the data you might not get so huge savings by using some other format.
>
> If you can give more details about what you are doing with the data we
> could probably help with your task.
>
> Slowness probably happens because Spark is using disk to process the data
> into single partition for writing the single file, one thing to reconsider
> is that if you can merge the product files after the process or even
> pre-partition it for it’s final use case.
>
> - Teemu
>
> On 9.3.2018, at 12.23, Md. Rezaul Karim <rezaul.ka...@insight-centre.org>
> wrote:
>
> Dear All,
>
> I have a tiny CSV file, which is around 250MB. There are only 30 columns
> in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an
> another CSV file on disk for later usage.
>
> However, I'm getting pissed off as writing the resultant DataFrame is
> taking too long, which is about 4 to 5 hours. Nevertheless, the size of the
> file written on the disk is about 58GB!
>
> Here's the sample code that I tried:
>
> # Using repartition()
> myDF.repartition(1).write.format("com.databricks.spark.
> csv").save("data/file.csv")
>
> # Using coalesce()
> myDF. coalesce(1).write.format("com.databricks.spark.csv").save("
> data/file.csv")
>
>
> Any better suggestion?
>
>
>
> ----
> Md. Rezaul Karim, BSc, MSc
> Research Scientist, Fraunhofer FIT, Germany
> Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany
> eMail: rezaul.ka...@fit.fraunhofer.de <andrea.berna...@fit.fraunhofer.de>
> Tel: +49 241 80-21527 <+49%20241%208021527>
>
>
>

Re: Writing a DataFrame is taking too long and huge space

Reply via email to