Re: Spark FP-growth

Sean Owen Thu, 07 May 2020 11:26:50 -0700

The confidence calculation is pretty trivial, the work is finding the
supports needed. Not sure how to optimize that.


On Thu, May 7, 2020, 1:12 PM Aditya Addepalli <dyex...@gmail.com> wrote:

> Hi Sean,
>
> 1.
> I was thinking that by specifying the consequent we can (somehow?) skip
> the confidence calculation for all the other consequents.
>
> This would greatly reduce the time taken as we avoid computation for
> consequents we don't care about.
>
>
> 2.
> Is limiting rule size even possible? I thought because of FP growth's
> depth first nature it might not be possible.
>
> My experience with Fp-growth has largely been in python where the API is
> limited. I will take a look at the scala source code and get back to you
> with more concrete answers.
>
> Thanks & Regards,
> Aditya
>
> On Thu, 7 May, 2020, 11:21 pm Sean Owen, <sro...@gmail.com> wrote:
>
>> Yes, you can get the correct support this way by accounting for how
>> many rows were filtered out, but not the right confidence, as it
>> depends on counting support in rows without the items of interest.
>>
>> But computing confidence depends on computing all that support; how
>> would you optimize it even if you knew the consequent you cared about?
>> maybe there's a way, sure, I don't know the code well but it wasn't
>> obvious at a glance how to take advantage of it.
>>
>> I can see how limiting the rule size could help.
>>
>> On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli <dyex...@gmail.com>
>> wrote:
>> >
>> > Hi Sean,
>> >
>> > I understand your approach, but there's a slight problem.
>> >
>> > If we generate rules after filtering for our desired consequent, we are
>> introducing some bias into our rules.
>> > The confidence of the rules on the filtered input can be very high but
>> this may not be the case on the entire dataset.
>> > Thus we can get biased rules which wrongly depict the patterns in the
>> data.
>> > This is why I think having a parameter to mention the consequent would
>> help greatly.
>> >
>> > Reducing the support doesn't really work in my case simply because
>> rules for the consequents I am mining for occur very rarely in the data.
>> > Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less
>> than that to capture the rules for that consequent.
>> >
>> > Thanks for your reply. Let me know what you think.
>> >
>> > Regards.
>> > Aditya Addepalli
>> >
>> >
>> >
>> >
>> > On Sat, 2 May, 2020, 9:13 pm Sean Owen, <sro...@gmail.com> wrote:
>> >>
>> >> You could just filter the input for sets containing the desired item,
>> >> and discard the rest. That doesn't mean all of the item sets have that
>> >> item, and you'd still have to filter, but may be much faster to
>> >> compute.
>> >> Increasing min support might generally have the effect of smaller
>> >> rules, though it doesn't impose a cap. That could help perf, if that's
>> >> what you're trying to improve.
>> >> I don't know if it's worth new params in the implementation, maybe. I
>> >> think there would have to be an argument this generalizes.
>> >>
>> >> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <dyex...@gmail.com>
>> wrote:
>> >> >
>> >> > Hi Everyone,
>> >> >
>> >> > I was wondering if we could make any enhancements to the FP-Growth
>> algorithm in spark/pyspark.
>> >> >
>> >> > Many times I am looking for a rule for a particular consequent, so I
>> don't need the rules for all the other consequents. I know I can filter the
>> rules to get the desired output, but if I could input this in the algorithm
>> itself, the execution time would reduce drastically.
>> >> >
>> >> > Also, sometimes I want the rules to be small, maybe of length 5-6.
>> Again, I can filter on length but I was wondering if we could take this as
>> input into the algo. Given the Depth first nature of FP-Growth, I am not
>> sure that is feasible.
>> >> >
>> >> >  I am willing to work on these suggestions, if someone thinks they
>> are feasible. Thanks to the dev team for all the hard work!
>> >> >
>> >> > Regards,
>> >> > Aditya Addepalli
>>
>

Re: Spark FP-growth

Reply via email to