Hi, I have a collection of text documents, I extracted the list of significat terms from that collection. I want to calculate co-occurance matrix for the extracted terms by using spark.
I actually stored the the collection of text document in a DataFrame, StructType schema = *new* StructType(*new* StructField[] { *new* StructField("ID", DataTypes.*StringType*, *false*, Metadata.*empty*()), *new* StructField("text", DataTypes.*StringType*, *false*, Metadata.*empty*()) }); // Create a DataFrame *wrt* a new schema DataFrame preProcessedDF = sqlContext.createDataFrame(jrdd, schema); I can extract the list of terms from "preProcessedDF " into a List or RDD or DataFrame. for each (term_i,term_j) I want to calculate the realted frequency from the original dataset "preProcessedDF " anyone has scalbale soloution? thank you, Donni