How much RAM are you giving to the driver? 17000 items being collected shouldn't fail unless your driver memory is too low.
Regards Sab On 13-Jan-2016 6:14 am, "Ritu Raj Tiwari" <[email protected]> wrote: > Folks: > We are running into a problem where FPGrowth seems to choke on data sets > that we think are not too large. We have about 200,000 transactions. Each > transaction is composed of on an average 50 items. There are about 17,000 > unique item (SKUs) that might show up in any transaction. > > When running locally with 12G ram given to the PySpark process, the > FPGrowth code fails with out of memory error for minSupport of 0.001. The > failure occurs when we try to enumerate and save the frequent itemsets. > Looking at the FPGrowth code ( > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala), > it seems this is because the genFreqItems() method tries to collect() all > items. Is there a way the code could be rewritten so it does not try to > collect and therefore store all frequent item sets in memory? > > Thanks for any insights. > > -Raj >
