Hi All, I came across an interesting paper from Google on machine learning[1], where they came up with an efficient representation for words from a corpus. These representations are called word embeddings in general, and they have titled their method as word2vec.
It is a two layer neural network which given a corpus as input, produces a set of word vectors as its output. These vectors represent each word in the corpus in a vector space, where words with similar semantics lie nearer to each other in that space. There are two methods of training the data: 1. Bag of words: here the ordering of the words in the corpus is not considered. It can be thought of like, given a word, what are the other words similar to this. 2. Skip grams: It considers the ordering of the words, it can be thought of like, if given word w1, what is the probability of word w2 appearing next. They have shown interesting implications of this, for example, "France" and "Italy" are closer to each other in the model that they trained. Another interesting observation is the application of vector algebra here, for example they show that: vector(king) - vector(man) + vector(woman) = vector(queen). This technique is becoming widely popular and has applications in areas like search, question answering, summarization. I've trained this on our man page corpus data (plus some man pages from pkgsrc) and put a demo here: https://man-k.org/words/ Some of the interesting queries that I found: bug: gives <defect, problem, undetected, lurk etc> in the top results man: gives <mdoc, html, overview, readme> netbsd: shows <freebsd, openbsd, ultrix, linux> christos: <zoulas, cornell> in the top two Give it a try and let me know how you like it. :) Coming Soon: I still need to implement the interface for doing vector addition and subtraction. BUGS: Use single word queries in non-plural form for best experience ;) [1] http://arxiv.org/pdf/1301.3781.pdf Regards Abhinav