[ 
https://issues.apache.org/jira/browse/FLINK-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633492#comment-15633492
 ] 

ASF GitHub Bot commented on FLINK-2094:
---------------------------------------

Github user thvasilo commented on the issue:

    https://github.com/apache/flink/pull/2735
  
    Thank you for your contribution Kalman!
    
    I just took a brief look, this is a big PR so will probably take some time 
to review.
    
    For now a few things that jump to mind: 
    
    * We'll need to add docs for the algorithm, which should be example heavy. 
[Here's](https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/standard_scaler.html)
 a simple example for another pre-processing algorithm. I see you already have 
extensive ScalaDoc's we could prolly replicate those in the docs.
    * Have you tested it in a relatively large scale dataset? Ideally in a 
distributed setting where the input files are on HDFS. This way we test the 
scalability of the implementation, and problems usually arise.
    * Have you compared the output with a reference implementation? My 
knowledge of word2vec is not very deep but as far as I understand the output is 
non-deterministic, so we would need some sort of proxy to evaluate the 
integrated correctness of the implementation.
    * Finally I see this introduces a new nlp package. I'm not sure how to 
treat this (and relevant algorithms, say TF-IDF), as they are not necessarily 
NLP specific, even though they stem from the field you could treat any sequence 
of objects as a "sentence" and embed them. I would favor including them as 
pre-processing steps and hence inheriting from the `Transformer` interface, 
perhaps by having a feature pre-processing package.
    
    Regards,
    Theodore


> Implement Word2Vec
> ------------------
>
>                 Key: FLINK-2094
>                 URL: https://issues.apache.org/jira/browse/FLINK-2094
>             Project: Flink
>          Issue Type: Improvement
>          Components: Machine Learning Library
>            Reporter: Nikolaas Steenbergen
>            Assignee: Nikolaas Steenbergen
>            Priority: Minor
>              Labels: ML
>
> implement Word2Vec
> http://arxiv.org/pdf/1402.3722v1.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to