[jira] [Commented] (FLINK-2034) Add vision and roadmap for ML library to docs

ASF GitHub Bot (JIRA) Thu, 21 May 2015 02:39:14 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553977#comment-14553977
 ]


ASF GitHub Bot commented on FLINK-2034:
---------------------------------------

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/688#discussion_r30786639
  
    --- Diff: docs/libs/ml/index.md ---
    @@ -20,8 +20,100 @@ specific language governing permissions and limitations
     under the License.
     -->
     
    +The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
    +community. Our goal is is to design and implement a system that is 
scalable and can deal with
    +problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
    +We call this library FlinkML.
    +
    +An important concern for developers of ML systems is the amount of glue 
code that developers are
    +forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
    +is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
    +tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
    +program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
    +another technology for data ingestion and data munging.
    +
    +Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
    +detailed documentation along with examples for every part of the system. 
Our aim is that developers
    +will be able to get started with writing their ML pipelines quickly, using 
familiar programming
    +concepts and terminology.
    +
    +Contrary to other data-processing systems, Flink exploits in-memory data 
streaming, and natively
    +executes iterative processing algorithms which are common in ML. We plan 
to exploit the streaming
    +nature of Flink, and provide functionality designed specifically for data 
streams.
    +
    +FlinkML will allow data scientists to test their models locally and using 
subsets of data, and then
    +use the same code to run their algorithms at a much larger scale in a 
cluster setting.
    +
    +We are inspired by other open source efforts to provide ML systems, in 
particular
    +[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML 
pipelines, and Spark’s
    +[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that 
scale with problem and
    +cluster sizes.
    +
    +We already have some of the building blocks for FlinkML in place, and will 
continue to extend the
    +library with more algorithms. An example of how simple it is to create a 
learning model in
    +FlinkML is given below:
    +
    +{% highlight scala %}
    +// LabelbedVector is a feature vector with a label (class or real value)
    +val data: DataSet[LabelVector] = ...
    +
    +val learner = MultipleLinearRegression()
    +
    +val parameters = ParameterMap()
    +  .add(MultipleLinearRegression.Stepsize, 1.0)
    +  .add(MultipleLinearRegression.Iterations, 10)
    +  .add(MultipleLinearRegression.ConvergenceThreshold, 0.001)
    +
    +val model = learner.fit(data, parameters)
    +{% endhighlight %}
    +
    +The roadmap below can provide an indication of the algorithms we aim to 
implement in the coming
    +months. Items in **bold** have already been implemented:
    +
    +
    +* Pipelines of transformers and learners
    +* Data pre-processing
    +  * **Feature scaling**
    +  * **Polynomial feature base mapper**
    +  * Feature hashing
    +  * Feature extraction for text
    +  * Dimensionality reduction
    +* Model selection and performance evaluation
    +  * Cross-validation for model selection and evaluation
    +* Supervised learning
    +  * Optimization framework
    +    * **Stochastic Gradient Descent**
    +    * L-BFGS
    +  * Generalized Linear Models
    +    * **Multiple linear regression**
    +    * LASSO, Ridge regression
    +    * Multi-class Logistic regression
    +  * Random forests
    +  * **Support Vector Machines**
    +* Unsupervised learning
    +  * Clustering
    +    * K-means clustering
    +  * PCA
    +* Recommendation
    +  * **ALS**
    +* Text analytics
    +  * LDA
    +* Statistical estimation tools
    +* Distributed linear algebra
    +* Streaming ML
    --- End diff --
    
    That's a good idea, I'll add a link to the roadmap instead.


> Add vision and roadmap for ML library to docs
> ---------------------------------------------
>
>                 Key: FLINK-2034
>                 URL: https://issues.apache.org/jira/browse/FLINK-2034
>             Project: Flink
>          Issue Type: Improvement
>          Components: Machine Learning Library
>            Reporter: Theodore Vasiloudis
>            Assignee: Theodore Vasiloudis
>              Labels: ML
>             Fix For: 0.9
>
>
> We should have a document describing the vision of the Machine Learning 
> library in Flink and an up to date roadmap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2034) Add vision and roadmap for ML library to docs

Reply via email to