Hi Max, Thanks for the question and sharing your findings. To be honest, I was not aware some of the projects until I see your list.
First, to answer you questions: > (i) Has anyone used them? While I am not sure about the number of users of every listed project, Alink is definitely used by Alibaba. In fact, Alink team is trying to contribute the code to Flink repo to and become the new FlinkML library. Besides, I would like to add flink-ai-extended ( https://github.com/alibaba/flink-ai-extended) to the list. This project allows you to run TensorFlow / PyTorch on top of Flink. It is actively used and maintained by Alibaba as well. (ii) More specifically, has someone implemented *Stochastic Gradient > Descent, Skip-gram models, Autoencoders* with any of the above tools (or > other)? I think Alink has SGD there, but I did not find skip-gram / Autoencorder. Some more comments / replies below: > I assume it is more efficient to do all the training in Flink (somehow) > rather than (re)training a model in Tensorflow (or Pytorch) and porting it > to a flink Job. For instance, > > https://stackoverflow.com/questions/59563265/embedd-existing-ml-model-in-apache-flink > Especially, in streaming ML systems the training and the serving should > both > happen in an online fashion. I guess it depends on what exactly you want to do. If you are doing a training running for hours with a lot of rounds of iterations until it converges, having it trained separately and then porting it to Flink for inference might not lose too much efficiency. However, if you are doing online learning to incrementally update your model as the samples flow by, having such incremental training embedded into Flink would make a lot of sense. Flink-ai-extended was created to support both cases, but it is definitely more attractive in the incremental training case. 1) *FlinkML(DataSet API)* > > https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/index.html > This is not for streaming ML as it shits on top of DataSet API. In > addition, > recently the library is dropped > https://stackoverflow.com/questions/58752787/what-is-the-status-of-flinkml > but there is ongoing development (??) of a new library on top of TableAPI. > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs > https://issues.apache.org/jira/browse/FLINK-12470 > which is not in the 1.10 distribution. We removed the DataSet based FlinkML library because at that point it looks that there is no users of it. Removing it allows us to use cleaner package paths. That said, personally I agree that we should mark the library as deprecated and remove it from the code base in a later release. It looks you are looking for a ML algorithm library. Not sure if you are also interested in ML engineering part. We have an ongoing project called Flink AI Flow which allows you define an end-to-end online learning workflow, with datasets, models and metrics managed. I had a talk about it at the recent Flink Forward virtual event. The videos should be available soon. But feel free to reach out to me for more details. Thanks, Jiangjie (Becket) Qin On Wed, Apr 29, 2020 at 1:12 AM Timo Walther <twal...@apache.org> wrote: > Hi Max, > > as far as I know a better ML story for Flink is in the making. I will > loop in Becket in CC who may give you more information. > > Regards, > Timo > > On 28.04.20 07:20, m@xi wrote: > > Hello Flinkers, > > > > I am building a *streaming* prototype system on top of Flink and I want > > ideally to enable ML training (if possible DL) in Flink. It would be > nice to > > lay down all the existing libraries that provide primitives that enable > the > > training of ML models. > > > > I assume it is more efficient to do all the training in Flink (somehow) > > rather than (re)training a model in Tensorflow (or Pytorch) and porting > it > > to a flink Job. For instance, > > > https://stackoverflow.com/questions/59563265/embedd-existing-ml-model-in-apache-flink > > Especially, in streaming ML systems the training and the serving should > both > > happen in an online fashion. > > > > To initialize the pool, I have found the following options that run on > top > > of Flink i.e., leveraging the engine for distributed and scalable ML > > training. > > > > 1) *FlinkML(DataSet API)* > > > https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/index.html > > This is not for streaming ML as it shits on top of DataSet API. In > addition, > > recently the library is dropped > > > https://stackoverflow.com/questions/58752787/what-is-the-status-of-flinkml > > but there is ongoing development (??) of a new library on top of > TableAPI. > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs > > https://issues.apache.org/jira/browse/FLINK-12470 > > which is not in the 1.10 distribution. > > > > 2) *Apache Mahout* https://mahout.apache.org/ > > I thought it was long dead, but recently they started developing it > again. > > > > 3) *Apache SAMOA* https://samoa.incubator.apache.org/ > > They are developing it, but slowly. It is an incubator project since > 2013. > > > > 4) *FlinkML Organization* https://github.com/FlinkML > > This one has repos that are interesting e.g. the flink-jpmml > > https://github.com/FlinkML/flink-jpmml > > and an implementation of a parameter server > > https://github.com/FlinkML/flink-parameter-server > > , which is really usefull when for enabling distributed training in a > sense > > that the model is also distributed during training. > > Though, the repo(s) are not really active. > > > > 5) *DeepLearning4j *https://deeplearning4j.org/ > > This is a distributed, deep learning library that it was said to work > also > > on top of Flink (here > > > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-support-for-DeepLearning4j-or-other-deep-learning-library-td12157.html > ) > > I am not interested at all in GPU support but I am wondering is anyone > had > > succesfully used this one on top of Flink. > > > > 6) *Proteus - SOLMA* https://github.com/proteus-h2020/proteus-solma > > It is a scalable online learning library on top of Flink, and is the > output > > of a H2020 research project called PROTEUS. > > http://www.bdva.eu/sites/default/files/hbouchachia_sacbd-ecsa18.pdf > > > > 7) *Alibaba - ALink* > > https://github.com/alibaba/Alink/blob/master/README.en-US.md > > A machine learning algorithm platform from Alibaba which is actively > > maintained. > > > > These are all the available systems that I have found ML using Flink's > > engine. > > > > *Questions* > > (i) Has anyone used them? > > (ii) More specifically, has someone implemented *Stochastic Gradient > > Descent, Skip-gram models, Autoencoders* with any of the above tools (or > > other)? > > > > *Remarks* > > If you have any experiences/comments/additions to share please do it! > Gotta > > Catch 'Em All! <https://www.youtube.com/watch?v=MpaHR-V_R-o> > > > > Best, > > Max > > > > > > > > > > -- > > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ > > > >