Re: FPGrowth does not handle large result sets

Sabarish Sasidharan Tue, 12 Jan 2016 18:51:07 -0800

How much RAM are you giving to the driver? 17000 items being collected
shouldn't fail unless your driver memory is too low.


Regards
Sab
On 13-Jan-2016 6:14 am, "Ritu Raj Tiwari" <[email protected]>
wrote:

> Folks:
> We are running into a problem where FPGrowth seems to choke on data sets
> that we think are not too large. We have about 200,000 transactions. Each
> transaction is composed of on an average 50 items. There are about 17,000
> unique item (SKUs) that might show up in any transaction.
>
> When running locally with 12G ram given to the PySpark process, the
> FPGrowth code fails with out of memory error for minSupport of 0.001. The
> failure occurs when we try to enumerate and save the frequent itemsets.
> Looking at the FPGrowth code (
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
> it seems this is because the genFreqItems() method tries to collect() all
> items. Is there a way the code could be rewritten so it does not try to
> collect and therefore store all frequent item sets in memory?
>
> Thanks for any insights.
>
> -Raj
>

Re: FPGrowth does not handle large result sets

Reply via email to