[ 
https://issues.apache.org/jira/browse/FLINK-27826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhipeng Zhang updated FLINK-27826:
----------------------------------
    Description: 
There is limited support for training high dimensional machine learning models 
in FlinkML though it is often useful, especially in industrial cases. When the 
size of the model parameter can not be hold in the memory of a single machine, 
FlinkML crashes now.

So it would be nice if we support high dimensional model training in FlinkML. 
To achieve this, we probably need to do the following things:
 # Do a survey on how to training large machine learning models of existing 
machine learning systems (e.g. data paralllel, model parallel).
 # Define/Implement the infra of supporting large model training in FlinkML.
 # Implement a logistic regression model that can train models with more than 
ten billion parameters.
 # Benchmark the implementation and further improve it.

  was:
There is limited support for training high dimensional machine learning models 
in FlinkML though it is often useful, especially in industrial cases. When the 
size of the model parameter can not be hold in the memory of a single machine, 
FlinkML crashes now.

So it would be nice if we support high dimensional model training in FlinkML. 
To achieve this, we probably need to do the following things:
 # Do a survey on how to training large machine learning models of existing 
machine learning systems (e.g. data paralllel, model parallel)
 # Define/Implement the infra of supporting large model training in FlinkML
 # Implement a logistic regression model that can train models with more than 
ten billion parameters
 # Benchmark the implementation and further improve it


> Support machine learning training for very high dimesional models
> -----------------------------------------------------------------
>
>                 Key: FLINK-27826
>                 URL: https://issues.apache.org/jira/browse/FLINK-27826
>             Project: Flink
>          Issue Type: New Feature
>          Components: Library / Machine Learning
>            Reporter: Zhipeng Zhang
>            Assignee: Zhipeng Zhang
>            Priority: Major
>
> There is limited support for training high dimensional machine learning 
> models in FlinkML though it is often useful, especially in industrial cases. 
> When the size of the model parameter can not be hold in the memory of a 
> single machine, FlinkML crashes now.
> So it would be nice if we support high dimensional model training in FlinkML. 
> To achieve this, we probably need to do the following things:
>  # Do a survey on how to training large machine learning models of existing 
> machine learning systems (e.g. data paralllel, model parallel).
>  # Define/Implement the infra of supporting large model training in FlinkML.
>  # Implement a logistic regression model that can train models with more than 
> ten billion parameters.
>  # Benchmark the implementation and further improve it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to