[ https://issues.apache.org/jira/browse/FLINK-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597355#comment-14597355 ]
ASF GitHub Bot commented on FLINK-1727: --------------------------------------- GitHub user sachingoel0101 opened a pull request: https://github.com/apache/flink/pull/861 [Flink-2030][ml]Online Histogram: Discrete and Categorical This implements the Online Histograms for both categorical and continuous data. For continuous data, we emulate a continuous probability distribution which supports finding cumulative sum upto a particular value, and finding value upto a specific cumulative probability [Quantiles]. For categorical fields, we emulate a probability mass function which supports finding the probability associated with every class. The continuous histogram follows this paper: http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf Note: This is a sub-task of https://issues.apache.org/jira/browse/FLINK-1727 which already has a PR pending review at https://github.com/apache/flink/pull/710. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachingoel0101/flink online_histogram Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/861.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #861 ---- commit ec50b4bb4faf91570724b4aa79783936d0a9487f Author: Sachin Goel <sachingoel0...@gmail.com> Date: 2015-06-23T08:40:57Z Online Histogram: Discrete and Categorical, Test Suites included ---- > Add decision tree to machine learning library > --------------------------------------------- > > Key: FLINK-1727 > URL: https://issues.apache.org/jira/browse/FLINK-1727 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Sachin Goel > Labels: ML > > Decision trees are widely used for classification and regression tasks. Thus, > it would be worthwhile to add support for them to Flink's machine learning > library. > A streaming parallel decision tree learning algorithm has been proposed by > Ben-Haim and Tom-Tov [1]. This can maybe adapted to a batch use case as well. > [2] contains an overview of different techniques of how to scale inductive > learning algorithms up. A presentation of Spark's MLlib decision tree > implementation can be found in [3]. > Resources: > [1] [http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf] > [2] > [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.8226&rep=rep1&type=pdf] > [3] > [http://spark-summit.org/wp-content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-Das-Sparks-Talwalkar.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)