On Sun, Jun 12, 2016 at 4:50 AM, Abhinav Upadhyay <er.abhinav.upadh...@gmail.com> wrote: > Hi All, > > I came across an interesting paper from Google on machine learning[1], > where they came up with an efficient representation for words from a > corpus. These representations are called word embeddings in general, > and they have titled their method as word2vec. > > It is a two layer neural network which given a corpus as input, > produces a set of word vectors as its output. These vectors represent > each word in the corpus in a vector space, where words with similar > semantics lie nearer to each other in that space. > > There are two methods of training the data: > 1. Bag of words: here the ordering of the words in the corpus is not > considered. It can be thought of like, given a word, what are the > other words similar to this. > 2. Skip grams: It considers the ordering of the words, it can be > thought of like, if given word w1, what is the probability of word w2 > appearing next. > > They have shown interesting implications of this, for example, > "France" and "Italy" are closer to each other in the model that they > trained. Another interesting observation is the application of vector > algebra here, for example they show that: > > vector(king) - vector(man) + vector(woman) = vector(queen). > > This technique is becoming widely popular and has applications in > areas like search, question answering, summarization. I've trained > this on our man page corpus data (plus some man pages from pkgsrc) and > put a demo here: https://man-k.org/words/ > > Some of the interesting queries that I found: > bug: gives <defect, problem, undetected, lurk etc> in the top results > man: gives <mdoc, html, overview, readme> > netbsd: shows <freebsd, openbsd, ultrix, linux> > christos: <zoulas, cornell> in the top two > > Give it a try and let me know how you like it. :) > > Coming Soon: I still need to implement the interface for doing vector > addition and subtraction. > > BUGS: Use single word queries in non-plural form for best experience ;) > > [1] http://arxiv.org/pdf/1301.3781.pdf >
Just fixed the internal server error, I guess I broke it right after sending the email :( - Abhinav