Github user kalmanchapman commented on the issue:

    https://github.com/apache/flink/pull/2735
  
    Hey Theodore,
    Thanks for taking a look at my PR!
    
    - I'll add docs shortly, per the examples you posted.
    - I've tested against datasets in the hundreds-of-megabytes size (using the 
preprocessed wikipedia articles available 
[here](http://mattmahoney.net/dc/textdata)) in a distributed, HDFS supported 
environment. The implementation worked well as the scale of the data increased 
- although I was experiencing some frustrating memory issues as I increased the 
number of iterations performed.
    - I can show that the vectors generated show good results along the lines 
of the original paper - that they show semantic similarity in line with cosine 
similarity and that difference vectors can be used to create 'analogy' 
relationships that make sense. But you're right that it's non-deterministic and 
surveying how it's tested in other libraries is inconclusive. I've included 
some toy datasets in the integration tests that show good results and exercise 
these qualities.
    - I know what you mean about the new package. I included it because the 
feature requests was specifically for Word2Vec. But - similar to your 
suggestion - the class in the nlp package is really just a wrapper around a 
generic embedding algorithm that can perform on any data that is 
word-in-sentence-like. The ContextEmbedder class, in the optimization package, 
is where the actual embedding is occurring.
    That said, optimization might not be the right home either (although we are 
optimizing toward some minima)
    
    Best,
    Kalman


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to