Github user thvasilo commented on the issue: https://github.com/apache/flink/pull/2735 Thank you for your contribution Kalman! I just took a brief look, this is a big PR so will probably take some time to review. For now a few things that jump to mind: * We'll need to add docs for the algorithm, which should be example heavy. [Here's](https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/standard_scaler.html) a simple example for another pre-processing algorithm. I see you already have extensive ScalaDoc's we could prolly replicate those in the docs. * Have you tested it in a relatively large scale dataset? Ideally in a distributed setting where the input files are on HDFS. This way we test the scalability of the implementation, and problems usually arise. * Have you compared the output with a reference implementation? My knowledge of word2vec is not very deep but as far as I understand the output is non-deterministic, so we would need some sort of proxy to evaluate the integrated correctness of the implementation. * Finally I see this introduces a new nlp package. I'm not sure how to treat this (and relevant algorithms, say TF-IDF), as they are not necessarily NLP specific, even though they stem from the field you could treat any sequence of objects as a "sentence" and embed them. I would favor including them as pre-processing steps and hence inheriting from the `Transformer` interface, perhaps by having a feature pre-processing package. Regards, Theodore
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---