I have a data frame which has two columns (id, vector (tf-idf)). The first column signifies the Id of the document while the second column is a Vector(tf-idf) values.
I want to use DIMSUM for cosine similarity but unfortunately I have Spark 1.x and looks like these methods are implemented only in Spark 2.x onwards and hence the corresponding cosineSimilarity method for RowMatrix is not there. So I thought maybe I can use the cosineSimilarity method of IndexedRowMatrix object as I see a corresponding cosine similarity method for IndexedRowMatrix docs. So here the couple of questions on the same. 1). So how do I first convert my spark data frame to IndexedRowMatrix format? 2) Does cosine similarity method in IndexedRowMatrix also uses DIMSUM as cosineSimilarity method of RowMatrix? 3). In RowMatrix, if I use Scala then I do have access to cosine similarity method there. However , it gives a matrix of similarities with no row indices (since RowMatrix is a index less matrix). So how do I infer the cosine similarity of each doc id with other from the output of RowMatrix? Please advise. Link to docs on IndexedRowMatrix. http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix.columnSimilarities ᐧ