But overall, I think the original approach is not correct.
If you get a single file in 10s GB, the approach is probably must be
reworked.
I don't see why you can't just write multiple CSV files using Spark, and
then concatenate them without Spark
On Fri, Mar 9, 2018 at 10:02 AM, Vadim Semenov wr
park users
Subject: Re: Writing a DataFrame is taking too long and huge space
because `coalesce` gets propagated further up in the DAG in the last stage, so
your last stage only has one task.
You need to break your DAG so your expensive operations would be in a previous
stage before the
You can use `.checkpoint` for that
`df.sort(…).coalesce(1).write...` — `coalesce` will make `sort` to have
only one partition, so sorting will take a lot of time
`df.sort(…).repartition(1).write...` — `repartition` will add an explicit
stage, but sorting will be lost, since it's a repartition
``
I would suggest repartioning it to reasonable partitions may ne 500 and
save it to some intermediate working directory .
Finally read all the files from this working dir and then coalesce as 1 and
save to final location.
Thanks
Deepak
On Fri, Mar 9, 2018, 20:12 Vadim Semenov wrote:
> because `
because `coalesce` gets propagated further up in the DAG in the last stage,
so your last stage only has one task.
You need to break your DAG so your expensive operations would be in a
previous stage before the stage with `.coalesce(1)`
On Fri, Mar 9, 2018 at 5:23 AM, Md. Rezaul Karim <
rezaul.ka.
Hi All,
Thanks for prompt response. Really appreciated! Here's a few more info:
1. Spark version: 2.3.0
2. vCore: 8
3. RAM: 32GB
4. Deploy mode: Spark standalone
*Operation performed:* I did transformations using StringIndexer on some
columns and null imputations. That's all.
*Why writing back
Sounds like you’re doing something else than just writing the same file back to
disk, what your preprocessing consists?
Sometimes you can save lot’s of space by using other formats but now we’re
talking over 200x increase in file size so depending on the transformations for
the data you might n
Which version of spark are you using?
The reason for asking this question is from Spark 2.x csv is internal
library so no need to save it with com.databricks.spark.csv package.
Moreover, taking time for this simple task is very much dependent upon your
cluster health. Could you please provide the
Hello, try to use parquet format with compression ( like snappy or lz4 ) so
the produced files will be smaller and it will generate less i/o. Moreover
normally parquet is more faster than csv format in reading for further
operations .
Another possible format is ORC file.
Kind Regards
Matteo
201
Dear All,
I have a tiny CSV file, which is around 250MB. There are only 30 columns in
the DataFrame. Now I'm trying to save the pre-processed DataFrame as an
another CSV file on disk for later usage.
However, I'm getting pissed off as writing the resultant DataFrame is
taking too long, which is a
10 matches
Mail list logo