[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581615#comment-14581615 ]
ASF GitHub Bot commented on FLINK-2072: --------------------------------------- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r32197169 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +25,214 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML). + +As defined by Murphy [1] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* **Supervised Learning** deals with learning a function (mapping) from a set of inputs +(features) to a set of outputs. The learning is done using a *training set* of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the *class* that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems one the other hand, are about predicting (real) numerical +values, often called the dependent variable, for example what the temperature will be tomorrow. + +* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example +of this would be *clustering*, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Linking with FlinkML + +In order to use FlinkML in you project, first you have to +[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink). +Next, you have to add the FlinkML dependency to the `pom.xml` of your project: + +{% highlight xml %} +<dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-ml</artifactId> + <version>{{site.version }}</version> +</dependency> +{% endhighlight %} + +## Loading data + +To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +As an example, we can use Haberman's Survival Data Set , which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data. --- End diff -- Missing closing parenthesis of the link. > Add a quickstart guide for FlinkML > ---------------------------------- > > Key: FLINK-2072 > URL: https://issues.apache.org/jira/browse/FLINK-2072 > Project: Flink > Issue Type: New Feature > Components: Documentation, Machine Learning Library > Reporter: Theodore Vasiloudis > Assignee: Theodore Vasiloudis > Fix For: 0.9 > > > We need a quickstart guide that introduces users to the core concepts of > FlinkML to get them up and running quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)