[ 
https://issues.apache.org/jira/browse/FLINK-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634747#comment-15634747
 ] 

ASF GitHub Bot commented on FLINK-2094:
---------------------------------------

Github user kalmanchapman commented on the issue:

    https://github.com/apache/flink/pull/2735
  
    Hey Theodore,
    Thanks for taking a look at my PR!
    
    - I'll add docs shortly, per the examples you posted.
    - I've tested against datasets in the hundreds-of-megabytes size (using the 
preprocessed wikipedia articles available 
[here](http://mattmahoney.net/dc/textdata)) in a distributed, HDFS supported 
environment. The implementation worked well as the scale of the data increased 
- although I was experiencing some frustrating memory issues as I increased the 
number of iterations performed.
    - I can show that the vectors generated show good results along the lines 
of the original paper - that they show semantic similarity in line with cosine 
similarity and that difference vectors can be used to create 'analogy' 
relationships that make sense. But you're right that it's non-deterministic and 
surveying how it's tested in other libraries is inconclusive. I've included 
some toy datasets in the integration tests that show good results and exercise 
these qualities.
    - I know what you mean about the new package. I included it because the 
feature requests was specifically for Word2Vec. But - similar to your 
suggestion - the class in the nlp package is really just a wrapper around a 
generic embedding algorithm that can perform on any data that is 
word-in-sentence-like. The ContextEmbedder class, in the optimization package, 
is where the actual embedding is occurring.
    That said, optimization might not be the right home either (although we are 
optimizing toward some minima)
    
    Best,
    Kalman


> Implement Word2Vec
> ------------------
>
>                 Key: FLINK-2094
>                 URL: https://issues.apache.org/jira/browse/FLINK-2094
>             Project: Flink
>          Issue Type: Improvement
>          Components: Machine Learning Library
>            Reporter: Nikolaas Steenbergen
>            Assignee: Nikolaas Steenbergen
>            Priority: Minor
>              Labels: ML
>
> implement Word2Vec
> http://arxiv.org/pdf/1402.3722v1.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to