Re: Spark FP-growth

Aditya Addepalli Thu, 07 May 2020 11:29:25 -0700

Absolutely. I meant to say the confidence calculation depends on the
support calculations and hence would reduce the time. Thanks for pointing
that out.


On Thu, 7 May, 2020, 11:56 pm Sean Owen, <sro...@gmail.com> wrote:

> The confidence calculation is pretty trivial, the work is finding the
> supports needed. Not sure how to optimize that.
>
> On Thu, May 7, 2020, 1:12 PM Aditya Addepalli <dyex...@gmail.com> wrote:
>
>> Hi Sean,
>>
>> 1.
>> I was thinking that by specifying the consequent we can (somehow?) skip
>> the confidence calculation for all the other consequents.
>>
>> This would greatly reduce the time taken as we avoid computation for
>> consequents we don't care about.
>>
>>
>> 2.
>> Is limiting rule size even possible? I thought because of FP growth's
>> depth first nature it might not be possible.
>>
>> My experience with Fp-growth has largely been in python where the API is
>> limited. I will take a look at the scala source code and get back to you
>> with more concrete answers.
>>
>> Thanks & Regards,
>> Aditya
>>
>> On Thu, 7 May, 2020, 11:21 pm Sean Owen, <sro...@gmail.com> wrote:
>>
>>> Yes, you can get the correct support this way by accounting for how
>>> many rows were filtered out, but not the right confidence, as it
>>> depends on counting support in rows without the items of interest.
>>>
>>> But computing confidence depends on computing all that support; how
>>> would you optimize it even if you knew the consequent you cared about?
>>> maybe there's a way, sure, I don't know the code well but it wasn't
>>> obvious at a glance how to take advantage of it.
>>>
>>> I can see how limiting the rule size could help.
>>>
>>> On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli <dyex...@gmail.com>
>>> wrote:
>>> >
>>> > Hi Sean,
>>> >
>>> > I understand your approach, but there's a slight problem.
>>> >
>>> > If we generate rules after filtering for our desired consequent, we
>>> are introducing some bias into our rules.
>>> > The confidence of the rules on the filtered input can be very high but
>>> this may not be the case on the entire dataset.
>>> > Thus we can get biased rules which wrongly depict the patterns in the
>>> data.
>>> > This is why I think having a parameter to mention the consequent would
>>> help greatly.
>>> >
>>> > Reducing the support doesn't really work in my case simply because
>>> rules for the consequents I am mining for occur very rarely in the data.
>>> > Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less
>>> than that to capture the rules for that consequent.
>>> >
>>> > Thanks for your reply. Let me know what you think.
>>> >
>>> > Regards.
>>> > Aditya Addepalli
>>> >
>>> >
>>> >
>>> >
>>> > On Sat, 2 May, 2020, 9:13 pm Sean Owen, <sro...@gmail.com> wrote:
>>> >>
>>> >> You could just filter the input for sets containing the desired item,
>>> >> and discard the rest. That doesn't mean all of the item sets have that
>>> >> item, and you'd still have to filter, but may be much faster to
>>> >> compute.
>>> >> Increasing min support might generally have the effect of smaller
>>> >> rules, though it doesn't impose a cap. That could help perf, if that's
>>> >> what you're trying to improve.
>>> >> I don't know if it's worth new params in the implementation, maybe. I
>>> >> think there would have to be an argument this generalizes.
>>> >>
>>> >> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <dyex...@gmail.com>
>>> wrote:
>>> >> >
>>> >> > Hi Everyone,
>>> >> >
>>> >> > I was wondering if we could make any enhancements to the FP-Growth
>>> algorithm in spark/pyspark.
>>> >> >
>>> >> > Many times I am looking for a rule for a particular consequent, so
>>> I don't need the rules for all the other consequents. I know I can filter
>>> the rules to get the desired output, but if I could input this in the
>>> algorithm itself, the execution time would reduce drastically.
>>> >> >
>>> >> > Also, sometimes I want the rules to be small, maybe of length 5-6.
>>> Again, I can filter on length but I was wondering if we could take this as
>>> input into the algo. Given the Depth first nature of FP-Growth, I am not
>>> sure that is feasible.
>>> >> >
>>> >> >  I am willing to work on these suggestions, if someone thinks they
>>> are feasible. Thanks to the dev team for all the hard work!
>>> >> >
>>> >> > Regards,
>>> >> > Aditya Addepalli
>>>
>>

Re: Spark FP-growth

Reply via email to