Do not do collect. This brings results back to driver. instead do count
distinct and write it out.

On Tue, 20 Oct 2020 at 6:43 am, Nicolas Paris <nicolas.pa...@riseup.net>
wrote:

> > I was caching it because I didn't want to re-execute the DAG when I
> > ran the count query. If you have a spark application with multiple
> > actions, Spark reexecutes the entire DAG for each action unless there
> > is a cache in between. I was trying to avoid reloading 1/2 a terabyte
> > of data.  Also, cache should use up executor memory, not driver
> > memory.
> why not counting the parquet file instead? writing/reading a parquet
> files is more efficients than caching in my experience.
> if you really need caching you could choose a better strategy such
> DISK.
>
> Lalwani, Jayesh <jlalw...@amazon.com.INVALID> writes:
>
> > I was caching it because I didn't want to re-execute the DAG when I ran
> the count query. If you have a spark application with multiple actions,
> Spark reexecutes the entire DAG for each action unless there is a cache in
> between. I was trying to avoid reloading 1/2 a terabyte of data.  Also,
> cache should use up executor memory, not driver memory.
> >
> > As it turns out cache was the problem. I didn't expect cache to take
> Executor memory and spill over to disk. I don't know why it's taking driver
> memory. The input data has millions of partitions which results in millions
> of tasks. Perhaps the high memory usage is a side effect of caching the
> results of lots of tasks.
> >
> > On 10/19/20, 1:27 PM, "Nicolas Paris" <nicolas.pa...@riseup.net> wrote:
> >
> >     CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
> >
> >
> >
> >     > Before I write the data frame to parquet, I do df.cache. After
> writing
> >     > the file out, I do df.countDistinct(“a”, “b”, “c”).collect()
> >     if you write the df to parquet, why would you also cache it ?
> caching by
> >     default loads the memory. this might affect  later use, such
> >     collect. the resulting GC can be explained by both caching and
> collect
> >
> >
> >     Lalwani, Jayesh <jlalw...@amazon.com.INVALID> writes:
> >
> >     > I have a Dataframe with around 6 billion rows, and about 20
> columns. First of all, I want to write this dataframe out to parquet. The,
> Out of the 20 columns, I have 3 columns of interest, and I want to find how
> many distinct values of the columns are there in the file. I don’t need the
> actual distinct values. I just need the count. I knoe that there are around
> 10-16million distinct values
> >     >
> >     > Before I write the data frame to parquet, I do df.cache. After
> writing the file out, I do df.countDistinct(“a”, “b”, “c”).collect()
> >     >
> >     > When I run this, I see that the memory usage on my driver steadily
> increases until it starts getting future time outs. I guess it’s spending
> time in GC. Does countDistinct cause this behavior? Does Spark try to get
> all 10 million distinct values into the driver? Is countDistinct not
> recommended for data frames with large number of distinct values?
> >     >
> >     > What’s the solution? Should I use approx._count_distinct?
> >
> >
> >     --
> >     nicolas paris
> >
> >     ---------------------------------------------------------------------
> >     To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
> --
> nicolas paris
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Best Regards,
Ayan Guha

Reply via email to