[ https://issues.apache.org/jira/browse/FLINK-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634747#comment-15634747 ]
ASF GitHub Bot commented on FLINK-2094: --------------------------------------- Github user kalmanchapman commented on the issue: https://github.com/apache/flink/pull/2735 Hey Theodore, Thanks for taking a look at my PR! - I'll add docs shortly, per the examples you posted. - I've tested against datasets in the hundreds-of-megabytes size (using the preprocessed wikipedia articles available [here](http://mattmahoney.net/dc/textdata)) in a distributed, HDFS supported environment. The implementation worked well as the scale of the data increased - although I was experiencing some frustrating memory issues as I increased the number of iterations performed. - I can show that the vectors generated show good results along the lines of the original paper - that they show semantic similarity in line with cosine similarity and that difference vectors can be used to create 'analogy' relationships that make sense. But you're right that it's non-deterministic and surveying how it's tested in other libraries is inconclusive. I've included some toy datasets in the integration tests that show good results and exercise these qualities. - I know what you mean about the new package. I included it because the feature requests was specifically for Word2Vec. But - similar to your suggestion - the class in the nlp package is really just a wrapper around a generic embedding algorithm that can perform on any data that is word-in-sentence-like. The ContextEmbedder class, in the optimization package, is where the actual embedding is occurring. That said, optimization might not be the right home either (although we are optimizing toward some minima) Best, Kalman > Implement Word2Vec > ------------------ > > Key: FLINK-2094 > URL: https://issues.apache.org/jira/browse/FLINK-2094 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library > Reporter: Nikolaas Steenbergen > Assignee: Nikolaas Steenbergen > Priority: Minor > Labels: ML > > implement Word2Vec > http://arxiv.org/pdf/1402.3722v1.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)