[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576767#comment-14576767 ]
ASF GitHub Bot commented on FLINK-2072: --------------------------------------- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896308 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets --- End diff -- Isnt' the TODO fixed? > Add a quickstart guide for FlinkML > ---------------------------------- > > Key: FLINK-2072 > URL: https://issues.apache.org/jira/browse/FLINK-2072 > Project: Flink > Issue Type: New Feature > Components: Documentation, Machine Learning Library > Reporter: Theodore Vasiloudis > Assignee: Theodore Vasiloudis > Fix For: 0.9 > > > We need a quickstart guide that introduces users to the core concepts of > FlinkML to get them up and running quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)