Hi George, I believe that sqlContext.cacheTable("tableName") is to be used when you want to cache the data that is being used within a Spark SQL query. For example, take a look at the code below.
> val myData = sqlContext.load("com.databricks.spark.csv", Map("path" -> > "hdfs://somepath/file", "header" -> "false").toDF("col1", "col2") > myData.registerTempTable("myData") Here, the usage of *cache()* will affect ONLY the *myData.select* query. > myData.cache() myData.select("col1", "col2").show() Here, the usage of *cacheTable* will affect ONLY the *sqlContext.sql* query. > sqlContext.cacheTable("myData") sqlContext.sql("SELECT col1, col2 FROM myData").show() Thanks, Kevin On Fri, Jan 15, 2016 at 7:00 AM, George Sigletos <sigle...@textkernel.nl> wrote: > According to the documentation they are exactly the same, but in my > queries > > dataFrame.cache() > > results in much faster execution times vs doing > > sqlContext.cacheTable("tableName") > > Is there any explanation about this? I am not caching the RDD prior to > creating the dataframe. Using Pyspark on Spark 1.5.2 > > Kind regards, > George >