Thanks Kevin for your reply. I was suspecting the same thing as well, although it still does not make much sense to me why would you need to do both: myData.cache() sqlContext.cacheTable("myData")
in case you are using both sqlContext and dataframes to execute queries dataframe.select(...) and sqlContext.sql("select ...") are equivalent, as far as I understand Kind regards, George On Fri, Jan 15, 2016 at 6:15 PM, Kevin Mellott <kevin.r.mell...@gmail.com> wrote: > Hi George, > > I believe that sqlContext.cacheTable("tableName") is to be used when you > want to cache the data that is being used within a Spark SQL query. For > example, take a look at the code below. > > >> val myData = sqlContext.load("com.databricks.spark.csv", Map("path" -> >> "hdfs://somepath/file", "header" -> "false").toDF("col1", "col2") >> > myData.registerTempTable("myData") > > > Here, the usage of *cache()* will affect ONLY the *myData.select* query. > >> myData.cache() > > myData.select("col1", "col2").show() > > > Here, the usage of *cacheTable* will affect ONLY the *sqlContext.sql* > query. > >> sqlContext.cacheTable("myData") > > sqlContext.sql("SELECT col1, col2 FROM myData").show() > > > Thanks, > Kevin > > On Fri, Jan 15, 2016 at 7:00 AM, George Sigletos <sigle...@textkernel.nl> > wrote: > >> According to the documentation they are exactly the same, but in my >> queries >> >> dataFrame.cache() >> >> results in much faster execution times vs doing >> >> sqlContext.cacheTable("tableName") >> >> Is there any explanation about this? I am not caching the RDD prior to >> creating the dataframe. Using Pyspark on Spark 1.5.2 >> >> Kind regards, >> George >> > >