[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

thvasilo Thu, 11 Jun 2015 01:04:42 -0700

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32197679
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
    +the complexities that usually come with having to deal with big data 
learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
    +familiar with Machine Learning (ML).
    +
    +As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
    +learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
    +(features) to a set of outputs. The learning is done using a *training 
set* of (input,
    +output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
    +further divided into classification and regression problems. In 
classification problems we try to
    +predict the *class* that an example belongs to, for example whether a user 
is going to click on
    +an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
    +values, often called the dependent variable, for example what the 
temperature will be tomorrow.
    +
    +* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
    +of this would be *clustering*, where we try to discover groupings of the 
data from the
    +descriptive features. Unsupervised learning can also be used for feature 
selection, for example
    +through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Linking with FlinkML
    +
    +In order to use FlinkML in you project, first you have to
    +[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
    +Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
    +
    +{% highlight xml %}
    +<dependency>
    +  <groupId>org.apache.flink</groupId>
    +  <artifactId>flink-ml</artifactId>
    +  <version>{{site.version }}</version>
    +</dependency>
    +{% endhighlight %}
    +
    +## Loading data
    +
    +To load data to be used with FlinkML we can use the ETL capabilities of 
Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
    +member which represents the label, which could be the class in a 
classification problem, or the dependent
    +variable for a regression problem.
    +
    +As an example, we can use Haberman's Survival Data Set , which you can
    +[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
    +This dataset *"contains cases from study conducted on the survival of 
patients who had undergone
    +surgery for breast cancer"*. The data comes in a comma-separated file, 
where the first 3 columns
    +are the features and last column is the class, and the 4th column 
indicates whether the patient
    +survived 5 years or longer (label 1), or died within 5 years (label 2). 
You can check the [UCI
    +page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for 
more information on the data.
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.api.scala.ExecutionEnvironment
    +
    +val env = ExecutionEnvironment.createLocalEnvironment(2)
    --- End diff --
    
    Good idea, I will use that instead.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Reply via email to