Re: Spark FP-growth

Aditya Addepalli Thu, 07 May 2020 11:13:11 -0700

Hi Sean,

1.
I was thinking that by specifying the consequent we can (somehow?) skip the
confidence calculation for all the other consequents.


This would greatly reduce the time taken as we avoid computation for
consequents we don't care about.


2.
Is limiting rule size even possible? I thought because of FP growth's depth
first nature it might not be possible.

My experience with Fp-growth has largely been in python where the API is
limited. I will take a look at the scala source code and get back to you
with more concrete answers.

Thanks & Regards,
Aditya

On Thu, 7 May, 2020, 11:21 pm Sean Owen, <sro...@gmail.com> wrote:

> Yes, you can get the correct support this way by accounting for how
> many rows were filtered out, but not the right confidence, as it
> depends on counting support in rows without the items of interest.
>
> But computing confidence depends on computing all that support; how
> would you optimize it even if you knew the consequent you cared about?
> maybe there's a way, sure, I don't know the code well but it wasn't
> obvious at a glance how to take advantage of it.
>
> I can see how limiting the rule size could help.
>
> On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli <dyex...@gmail.com>
> wrote:
> >
> > Hi Sean,
> >
> > I understand your approach, but there's a slight problem.
> >
> > If we generate rules after filtering for our desired consequent, we are
> introducing some bias into our rules.
> > The confidence of the rules on the filtered input can be very high but
> this may not be the case on the entire dataset.
> > Thus we can get biased rules which wrongly depict the patterns in the
> data.
> > This is why I think having a parameter to mention the consequent would
> help greatly.
> >
> > Reducing the support doesn't really work in my case simply because rules
> for the consequents I am mining for occur very rarely in the data.
> > Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less than
> that to capture the rules for that consequent.
> >
> > Thanks for your reply. Let me know what you think.
> >
> > Regards.
> > Aditya Addepalli
> >
> >
> >
> >
> > On Sat, 2 May, 2020, 9:13 pm Sean Owen, <sro...@gmail.com> wrote:
> >>
> >> You could just filter the input for sets containing the desired item,
> >> and discard the rest. That doesn't mean all of the item sets have that
> >> item, and you'd still have to filter, but may be much faster to
> >> compute.
> >> Increasing min support might generally have the effect of smaller
> >> rules, though it doesn't impose a cap. That could help perf, if that's
> >> what you're trying to improve.
> >> I don't know if it's worth new params in the implementation, maybe. I
> >> think there would have to be an argument this generalizes.
> >>
> >> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <dyex...@gmail.com>
> wrote:
> >> >
> >> > Hi Everyone,
> >> >
> >> > I was wondering if we could make any enhancements to the FP-Growth
> algorithm in spark/pyspark.
> >> >
> >> > Many times I am looking for a rule for a particular consequent, so I
> don't need the rules for all the other consequents. I know I can filter the
> rules to get the desired output, but if I could input this in the algorithm
> itself, the execution time would reduce drastically.
> >> >
> >> > Also, sometimes I want the rules to be small, maybe of length 5-6.
> Again, I can filter on length but I was wondering if we could take this as
> input into the algo. Given the Depth first nature of FP-Growth, I am not
> sure that is feasible.
> >> >
> >> >  I am willing to work on these suggestions, if someone thinks they
> are feasible. Thanks to the dev team for all the hard work!
> >> >
> >> > Regards,
> >> > Aditya Addepalli
>

Re: Spark FP-growth

Reply via email to