The confidence calculation is pretty trivial, the work is finding the supports needed. Not sure how to optimize that.
On Thu, May 7, 2020, 1:12 PM Aditya Addepalli <dyex...@gmail.com> wrote: > Hi Sean, > > 1. > I was thinking that by specifying the consequent we can (somehow?) skip > the confidence calculation for all the other consequents. > > This would greatly reduce the time taken as we avoid computation for > consequents we don't care about. > > > 2. > Is limiting rule size even possible? I thought because of FP growth's > depth first nature it might not be possible. > > My experience with Fp-growth has largely been in python where the API is > limited. I will take a look at the scala source code and get back to you > with more concrete answers. > > Thanks & Regards, > Aditya > > On Thu, 7 May, 2020, 11:21 pm Sean Owen, <sro...@gmail.com> wrote: > >> Yes, you can get the correct support this way by accounting for how >> many rows were filtered out, but not the right confidence, as it >> depends on counting support in rows without the items of interest. >> >> But computing confidence depends on computing all that support; how >> would you optimize it even if you knew the consequent you cared about? >> maybe there's a way, sure, I don't know the code well but it wasn't >> obvious at a glance how to take advantage of it. >> >> I can see how limiting the rule size could help. >> >> On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli <dyex...@gmail.com> >> wrote: >> > >> > Hi Sean, >> > >> > I understand your approach, but there's a slight problem. >> > >> > If we generate rules after filtering for our desired consequent, we are >> introducing some bias into our rules. >> > The confidence of the rules on the filtered input can be very high but >> this may not be the case on the entire dataset. >> > Thus we can get biased rules which wrongly depict the patterns in the >> data. >> > This is why I think having a parameter to mention the consequent would >> help greatly. >> > >> > Reducing the support doesn't really work in my case simply because >> rules for the consequents I am mining for occur very rarely in the data. >> > Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less >> than that to capture the rules for that consequent. >> > >> > Thanks for your reply. Let me know what you think. >> > >> > Regards. >> > Aditya Addepalli >> > >> > >> > >> > >> > On Sat, 2 May, 2020, 9:13 pm Sean Owen, <sro...@gmail.com> wrote: >> >> >> >> You could just filter the input for sets containing the desired item, >> >> and discard the rest. That doesn't mean all of the item sets have that >> >> item, and you'd still have to filter, but may be much faster to >> >> compute. >> >> Increasing min support might generally have the effect of smaller >> >> rules, though it doesn't impose a cap. That could help perf, if that's >> >> what you're trying to improve. >> >> I don't know if it's worth new params in the implementation, maybe. I >> >> think there would have to be an argument this generalizes. >> >> >> >> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <dyex...@gmail.com> >> wrote: >> >> > >> >> > Hi Everyone, >> >> > >> >> > I was wondering if we could make any enhancements to the FP-Growth >> algorithm in spark/pyspark. >> >> > >> >> > Many times I am looking for a rule for a particular consequent, so I >> don't need the rules for all the other consequents. I know I can filter the >> rules to get the desired output, but if I could input this in the algorithm >> itself, the execution time would reduce drastically. >> >> > >> >> > Also, sometimes I want the rules to be small, maybe of length 5-6. >> Again, I can filter on length but I was wondering if we could take this as >> input into the algo. Given the Depth first nature of FP-Growth, I am not >> sure that is feasible. >> >> > >> >> > I am willing to work on these suggestions, if someone thinks they >> are feasible. Thanks to the dev team for all the hard work! >> >> > >> >> > Regards, >> >> > Aditya Addepalli >> >