[ https://issues.apache.org/jira/browse/FLINK-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553977#comment-14553977 ]
ASF GitHub Bot commented on FLINK-2034: --------------------------------------- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/688#discussion_r30786639 --- Diff: docs/libs/ml/index.md --- @@ -20,8 +20,100 @@ specific language governing permissions and limitations under the License. --> +The Machine Learning (ML) library for Flink is a new effort to bring scalable ML tools to the Flink +community. Our goal is is to design and implement a system that is scalable and can deal with +problems of various sizes, whether your data size is measured in megabytes or terabytes and beyond. +We call this library FlinkML. + +An important concern for developers of ML systems is the amount of glue code that developers are +forced to write [1] in the process of implementing an end-to-end ML system. Our goal with FlinkML +is to help developers keep glue code to a minimum. The Flink ecosystem provides a great setting to +tackle this problem, with its scalable ETL capabilities that can be easily combined inside the same +program with FlinkML, allowing the development of robust pipelines without the need to use yet +another technology for data ingestion and data munging. + +Another goal for FlinkML is to make the library easy to use. To that end we will be providing +detailed documentation along with examples for every part of the system. Our aim is that developers +will be able to get started with writing their ML pipelines quickly, using familiar programming +concepts and terminology. + +Contrary to other data-processing systems, Flink exploits in-memory data streaming, and natively +executes iterative processing algorithms which are common in ML. We plan to exploit the streaming +nature of Flink, and provide functionality designed specifically for data streams. + +FlinkML will allow data scientists to test their models locally and using subsets of data, and then +use the same code to run their algorithms at a much larger scale in a cluster setting. + +We are inspired by other open source efforts to provide ML systems, in particular +[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML pipelines, and Spark’s +[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that scale with problem and +cluster sizes. + +We already have some of the building blocks for FlinkML in place, and will continue to extend the +library with more algorithms. An example of how simple it is to create a learning model in +FlinkML is given below: + +{% highlight scala %} +// LabelbedVector is a feature vector with a label (class or real value) +val data: DataSet[LabelVector] = ... + +val learner = MultipleLinearRegression() + +val parameters = ParameterMap() + .add(MultipleLinearRegression.Stepsize, 1.0) + .add(MultipleLinearRegression.Iterations, 10) + .add(MultipleLinearRegression.ConvergenceThreshold, 0.001) + +val model = learner.fit(data, parameters) +{% endhighlight %} + +The roadmap below can provide an indication of the algorithms we aim to implement in the coming +months. Items in **bold** have already been implemented: + + +* Pipelines of transformers and learners +* Data pre-processing + * **Feature scaling** + * **Polynomial feature base mapper** + * Feature hashing + * Feature extraction for text + * Dimensionality reduction +* Model selection and performance evaluation + * Cross-validation for model selection and evaluation +* Supervised learning + * Optimization framework + * **Stochastic Gradient Descent** + * L-BFGS + * Generalized Linear Models + * **Multiple linear regression** + * LASSO, Ridge regression + * Multi-class Logistic regression + * Random forests + * **Support Vector Machines** +* Unsupervised learning + * Clustering + * K-means clustering + * PCA +* Recommendation + * **ALS** +* Text analytics + * LDA +* Statistical estimation tools +* Distributed linear algebra +* Streaming ML --- End diff -- That's a good idea, I'll add a link to the roadmap instead. > Add vision and roadmap for ML library to docs > --------------------------------------------- > > Key: FLINK-2034 > URL: https://issues.apache.org/jira/browse/FLINK-2034 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library > Reporter: Theodore Vasiloudis > Assignee: Theodore Vasiloudis > Labels: ML > Fix For: 0.9 > > > We should have a document describing the vision of the Machine Learning > library in Flink and an up to date roadmap. -- This message was sent by Atlassian JIRA (v6.3.4#6332)