Cosine Similarity Implementation in Spark

Manish Tripathi Mon, 30 Jan 2017 15:21:41 -0800

I have a data frame which has two columns (id, vector (tf-idf)). The first
column signifies the Id of the document while the second column is a
Vector(tf-idf) values.


I want to use DIMSUM for cosine similarity but unfortunately I have Spark
1.x and looks like these methods are implemented only in Spark 2.x onwards
and hence the corresponding cosineSimilarity method for RowMatrix is not
there.

So I thought maybe I can use the cosineSimilarity method of
IndexedRowMatrix object as I see a corresponding cosine similarity method
for IndexedRowMatrix docs.

So here the couple of questions on the same.

1). So how do I first convert my spark data frame to IndexedRowMatrix
format?

2) Does cosine similarity method in IndexedRowMatrix also uses DIMSUM as
cosineSimilarity method of RowMatrix?

3). In RowMatrix, if I use Scala then I do have access to cosine similarity
method there. However , it gives a matrix of similarities with no row
indices (since RowMatrix is a index less matrix). So how do I infer the
cosine similarity of each doc id with other from the output of RowMatrix?

Please advise.

Link to docs on IndexedRowMatrix.

http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix.columnSimilarities
ᐧ

Cosine Similarity Implementation in Spark

Reply via email to