See https://gist.github.com/geoHeil/e07922229860262ceebf830859716bbf in particular:
You will probably want to use sparks imperative (non SQL) API: .rdd .reduceByKey { (count1, count2) => count1 + count2 }.map { case ((word, path), n) => (word, (path, n)) }.toDF i.e. builds an inverted index which easily lets you then calculate TF / IDF But spark also comes with https://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf which might help you to easily achieve the desired result. Donni Khan <prince.don...@googlemail.com> schrieb am Mi., 4. Apr. 2018 um 10:56 Uhr: > Hi all, > > I want to run huge number of queries on Dataframe in Spark. I have a big > data of text documents, I loded all documents into SparkDataFrame and > create a temp table. > > dataFrame.registerTempTable("table1"); > > I have more than 50,000 terms, I want to get the document frequency for > each by using the "table1". > > I use the follwing: > > DataFrame df=sqlContext.sql("select count(ID) from table1 where text like > '%"+term+"%'"); > > but this scenario needs much time to finish because I have t run it from > Spark Driver for each term. > > > Does anyone has idea how I can run all queries in distributed way? > > Thank you && Best Regards, > > Donni > > > >