Do not do collect. This brings results back to driver. instead do count distinct and write it out.
On Tue, 20 Oct 2020 at 6:43 am, Nicolas Paris <nicolas.pa...@riseup.net> wrote: > > I was caching it because I didn't want to re-execute the DAG when I > > ran the count query. If you have a spark application with multiple > > actions, Spark reexecutes the entire DAG for each action unless there > > is a cache in between. I was trying to avoid reloading 1/2 a terabyte > > of data. Also, cache should use up executor memory, not driver > > memory. > why not counting the parquet file instead? writing/reading a parquet > files is more efficients than caching in my experience. > if you really need caching you could choose a better strategy such > DISK. > > Lalwani, Jayesh <jlalw...@amazon.com.INVALID> writes: > > > I was caching it because I didn't want to re-execute the DAG when I ran > the count query. If you have a spark application with multiple actions, > Spark reexecutes the entire DAG for each action unless there is a cache in > between. I was trying to avoid reloading 1/2 a terabyte of data. Also, > cache should use up executor memory, not driver memory. > > > > As it turns out cache was the problem. I didn't expect cache to take > Executor memory and spill over to disk. I don't know why it's taking driver > memory. The input data has millions of partitions which results in millions > of tasks. Perhaps the high memory usage is a side effect of caching the > results of lots of tasks. > > > > On 10/19/20, 1:27 PM, "Nicolas Paris" <nicolas.pa...@riseup.net> wrote: > > > > CAUTION: This email originated from outside of the organization. Do > not click links or open attachments unless you can confirm the sender and > know the content is safe. > > > > > > > > > Before I write the data frame to parquet, I do df.cache. After > writing > > > the file out, I do df.countDistinct(“a”, “b”, “c”).collect() > > if you write the df to parquet, why would you also cache it ? > caching by > > default loads the memory. this might affect later use, such > > collect. the resulting GC can be explained by both caching and > collect > > > > > > Lalwani, Jayesh <jlalw...@amazon.com.INVALID> writes: > > > > > I have a Dataframe with around 6 billion rows, and about 20 > columns. First of all, I want to write this dataframe out to parquet. The, > Out of the 20 columns, I have 3 columns of interest, and I want to find how > many distinct values of the columns are there in the file. I don’t need the > actual distinct values. I just need the count. I knoe that there are around > 10-16million distinct values > > > > > > Before I write the data frame to parquet, I do df.cache. After > writing the file out, I do df.countDistinct(“a”, “b”, “c”).collect() > > > > > > When I run this, I see that the memory usage on my driver steadily > increases until it starts getting future time outs. I guess it’s spending > time in GC. Does countDistinct cause this behavior? Does Spark try to get > all 10 million distinct values into the driver? Is countDistinct not > recommended for data frames with large number of distinct values? > > > > > > What’s the solution? Should I use approx._count_distinct? > > > > > > -- > > nicolas paris > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > -- > nicolas paris > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha