Shaoxuan Wang created FLINK-12470:
-------------------------------------

             Summary: FLIP39: Flink ML pipeline and ML libs
                 Key: FLINK-12470
                 URL: https://issues.apache.org/jira/browse/FLINK-12470
             Project: Flink
          Issue Type: New Feature
          Components: Library / Machine Learning
    Affects Versions: 1.9.0
            Reporter: Shaoxuan Wang
            Assignee: Shaoxuan Wang
             Fix For: 1.9.0


This is the umbrella Jira for FLIP39, which intents to to enhance the 
scalability and the ease of use of Flink ML. 

ML Discussion thread: 
[http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-39-Flink-ML-pipeline-and-ML-libs-td28633.html]

Google Doc: (will convert it to an official confluence page very soon ) 
[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo|https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo/edit]

In machine learning, there are mainly two types of people. The first type is 
MLlib developer. They need a set of standard/well abstracted core ML APIs to 
implement the algorithms. Every ML algorithm is a certain concrete 
implementation on top of these APIs. The second type is MLlib users who utilize 
the existing/packaged MLlib to train or server a model.  It is pretty common 
that the entire training or inference is constructed by a sequence of 
transformation or algorithms. It is essential to provide a workflow/pipeline 
API for MLlib users such that they can easily combine multiple algorithms to 
describe the ML workflow/pipeline.

Current Flink has a set of ML core inferences, but they are built on top of 
dataset API. This does not quite align with the latest flink 
[roadmap|https://flink.apache.org/roadmap.html] (TableAPI will become the first 
class citizen and primary API for analytics use cases, while dataset API will 
be gradually deprecated). Moreover, Flink at present does not have any 
interface that allows MLlib users to describe an ML workflow/pipeline, nor 
provides any approach to persist pipeline or model and reuse them in the 
future. To solve/improve these issues, in this FLIP we propose to:
 * Provide a new set of ML core interface (on top of Flink TableAPI)
 * Provide a ML pipeline interface (on top of Flink TableAPI)
 * Provide the interfaces for parameters management and pipeline persistence
 * All the above interfaces should facilitate any new ML algorithm. We will 
gradually add various standard ML algorithms on top of these new proposed 
interfaces to ensure their feasibility and scalability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to