[ https://issues.apache.org/jira/browse/FLINK-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633492#comment-15633492 ]
ASF GitHub Bot commented on FLINK-2094: --------------------------------------- Github user thvasilo commented on the issue: https://github.com/apache/flink/pull/2735 Thank you for your contribution Kalman! I just took a brief look, this is a big PR so will probably take some time to review. For now a few things that jump to mind: * We'll need to add docs for the algorithm, which should be example heavy. [Here's](https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/standard_scaler.html) a simple example for another pre-processing algorithm. I see you already have extensive ScalaDoc's we could prolly replicate those in the docs. * Have you tested it in a relatively large scale dataset? Ideally in a distributed setting where the input files are on HDFS. This way we test the scalability of the implementation, and problems usually arise. * Have you compared the output with a reference implementation? My knowledge of word2vec is not very deep but as far as I understand the output is non-deterministic, so we would need some sort of proxy to evaluate the integrated correctness of the implementation. * Finally I see this introduces a new nlp package. I'm not sure how to treat this (and relevant algorithms, say TF-IDF), as they are not necessarily NLP specific, even though they stem from the field you could treat any sequence of objects as a "sentence" and embed them. I would favor including them as pre-processing steps and hence inheriting from the `Transformer` interface, perhaps by having a feature pre-processing package. Regards, Theodore > Implement Word2Vec > ------------------ > > Key: FLINK-2094 > URL: https://issues.apache.org/jira/browse/FLINK-2094 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library > Reporter: Nikolaas Steenbergen > Assignee: Nikolaas Steenbergen > Priority: Minor > Labels: ML > > implement Word2Vec > http://arxiv.org/pdf/1402.3722v1.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)