hi Xiangrui, I am trying to implement the tfidf as per the instruction you sent in your response to Jatin. I am getting an error in idf step. Here are my steps that run till the last line where the compile fails.
val labeledDocs = sc.textFile("title_subcategory") val stopwords = scala.io.Source.fromFile("stopwords.txt").getLines().toList val labeledTerms = labeledDocs.map(_.split('\t')).map(x=>(x(2).toDouble,x(1).split(' ').map(_.toLowerCase).filter(!stopwords.contains(_)).toSeq)) val tf = new HashingTF() val freqs = labeledTerms.map(x=>(x._1,tf.transform(x._2))) val idf = new IDF() val idfModel = idf.fit(freqs.values) val vectors = freqs.map(x => LabeledPoint(x._1, idfModel.transform(x._2))) This is where it fails with the following error: NBContentSubcategory.scala:39: error: overloaded method value transform with alternatives: (dataset: org.apache.spark.api.java.JavaRDD[org.apache.spark.mllib.linalg.Vector])org.apache.spark.api.java.JavaRDD[org.apache.spark.mllib.linalg.Vector] <and> (dataset: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector])org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] cannot be applied to (org.apache.spark.mllib.linalg.Vector) val transformedValues = idfModel.transform(values) It seems to be getting confused with multiple (java and scala) transform methods. Any insights? Thanks, Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543p16057.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org