It should be the cosine similarity, yes. I think this is what was fixed in https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was really just outputting the 'unnormalized' similarity (dot / norm(a) only) but the docs said cosine similarity. Now it's cosine similarity in Spark 2. The normalization most certainly matters here, and it's the opposite: dividing the dot by vec norms gives you the cosine.
Although docs can always be better (and here was a case where it was wrong) all of this comes with javadoc and examples. Right now at least, .transform() describes the operation as you do, so it is documented. I'd propose you invest in improving the docs rather than saying 'this isn't what I expected'. (No, our book isn't a reference for MLlib, more like worked examples) On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi <tr.man...@gmail.com> wrote: > I used a word2vec algorithm of spark to compute documents vector of a text. > > I then used the findSynonyms function of the model object to get synonyms > of few words. > > I see something like this: > > > > > I do not understand why the cosine similarity is being calculated as more > than 1. Cosine similarity should be between 0 and 1 or max -1 and +1 > (taking negative angles). > > Why it is more than 1 here? What's going wrong here?. > > Please note, normalization of the vectors should not be changing the > cosine similarity values since the formula remains the same. If you > normalise it's just a dot product then, if you don't it's dot product/ > (normA)*(normB). > > I am facing lot of issues with respect to understanding or interpreting > the output of Spark's ml algos. The documentation is not very clear and > there is hardly anything mentioned with respect to how and what is being > returned. > > For ex. word2vec algorithm is to convert word to vector form. So I would > expect .transform method would give me vector of each word in the text. > > However .transform basically returns doc2vec (averages all word vectors of > a text). This is confusing since nothing of this is mentioned in the docs > and I keep thinking why I have only one word vector instead of word vectors > for all words. > > I do understand by returning doc2vec it is helpful since now one doesn't > have to average out each word vector for the whole text. But the docs don't > help or explicitly say that. > > This ends up wasting lot of time in just figuring out what is being > returned from an algorithm from Spark. > > Does someone have a better solution for this? > > I have read the Spark book. That is not about Mllib. I am not sure if > Sean's book would cover all the documentation aspect better than what we > have currently on the docs page. > > Thanks > > > > ᐧ >