[ https://issues.apache.org/jira/browse/FLINK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Rohrmann updated FLINK-1537: --------------------------------- Comment: was deleted (was: Implementing first a decision tree algorithm is definitely the right way to go. If you implemented it, then it would be an awesome contribution to Flink. And I think it's the best way to get used to Flink's API. Thus, it's a win-win situation :-) Look at the recently opened [machine learning PR|https://github.com/apache/flink/pull/479] which loosely defines interfaces for {{Learner}} and {{Transformer}}. A {{Learner}} is an algorithm which takes a {{DataSet[A]}} and fits a model to this data. In the case of a decision tree, the input data would be a labeled vector and the output would be the learned tree. A {{Transformer}} simply takes a {{DataSet[A]}} and transforms it into a {{DataSet[B]}}. A feature extractor or data whitening would be an example for that. {{Transformer}} can be arbitrarily chained as long as their types match. A {{Learner}} terminates a transformer pipeline. If you sticked to this model with your implementation, then one could prepend any {{Transformer}} to the decision tree learner. This makes creating a data analysis pipeline really easy. If I can help you with the implementation, then let me know. A deep learning framework is also something really intriguing but at the same time highly ambitious. So far, we haven't made an effort implementing deep learning algorithms with Flink. I know that there is the [H2O project|https://github.com/h2oai/h2o-dev] which does distributed deep learning. However, their underlying data model is different form ours. If I'm not mistaken, then they store the data column-wise whereas we store them row-wise. I don't know what difference this makes. The first thing would probably be to evaluate Flink's potential for deep learning and then to come up with a prototype.) > GSoC project: Machine learning with Apache Flink > ------------------------------------------------ > > Key: FLINK-1537 > URL: https://issues.apache.org/jira/browse/FLINK-1537 > Project: Flink > Issue Type: New Feature > Reporter: Till Rohrmann > Priority: Minor > Labels: gsoc2015, java, machine_learning, scala > > Currently, the Flink community is setting up the infrastructure for a machine > learning library for Flink. The goal is to provide a set of highly optimized > ML algorithms and to offer a high level linear algebra abstraction to easily > do data pre- and post-processing. By defining a set of commonly used data > structures on which the algorithms work it will be possible to define complex > processing pipelines. > The Mahout DSL constitutes a good fit to be used as the linear algebra > language in Flink. It has to be evaluated which means have to be provided to > allow an easy transition between the high level abstraction and the optimized > algorithms. > The machine learning library offers multiple starting points for a GSoC > project. Amongst others, the following projects are conceivable. > * Extension of Flink's machine learning library by additional ML algorithms > ** Stochastic gradient descent > ** Distributed dual coordinate ascent > ** SVM > ** Gaussian mixture EM > ** DecisionTrees > ** ... > * Integration of Flink with the Mahout DSL to support a high level linear > algebra abstraction > * Integration of H2O with Flink to benefit from H2O's sophisticated machine > learning algorithms > * Implementation of a parameter server like distributed global state storage > facility for Flink. This also includes the extension of Flink to support > asynchronous iterations and update messages. > Own ideas for a possible contribution on the field of the machine learning > library are highly welcome. -- This message was sent by Atlassian JIRA (v6.3.4#6332)