GitHub user jackylk opened a pull request:
https://github.com/apache/spark/pull/2847
[SPARK-4001][MLlib] adding apriori algorithm for frequent item set mining
in Spark
Apriori is the classic algorithm for frequent item set mining in a
transactional data set. It will be useful if Apriori algorithm is added to
MLLib in Spark. This PR add an implementation for it.
There is a point I am not sure wether it is most efficient. In order to
filter out the eligible frequent item set, currently I am using a cartesian
operation on two RDDs to calculate the degree of support of each item set, not
sure wether it is better to use broadcast variable to achieve the same.
I will add an example to use this algorithm if requires
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jackylk/spark apriori
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2847.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2847
----
commit da2cba7e063745aacef74ff555e7bd7c55a24f56
Author: Jacky Li <[email protected]>
Date: 2014-10-19T09:19:27Z
adding apriori algorithm for frequent item set mining in Spark
commit 889b33fdfabcc222c82e3bce619aeb6c7031fc58
Author: Jacky Li <[email protected]>
Date: 2014-10-19T09:31:04Z
modify per scalastyle check
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]