Re: sqlContext.cacheTable("tableName") vs dataFrame.cache()

Kevin Mellott Fri, 15 Jan 2016 09:15:46 -0800

Hi George,

I believe that sqlContext.cacheTable("tableName") is to be used when you
want to cache the data that is being used within a Spark SQL query. For
example, take a look at the code below.



> val myData = sqlContext.load("com.databricks.spark.csv", Map("path" ->
> "hdfs://somepath/file", "header" -> "false").toDF("col1", "col2")
>
myData.registerTempTable("myData")


Here, the usage of *cache()* will affect ONLY the *myData.select* query.

> myData.cache()

myData.select("col1", "col2").show()


Here, the usage of *cacheTable* will affect ONLY the *sqlContext.sql* query.

> sqlContext.cacheTable("myData")

sqlContext.sql("SELECT col1, col2 FROM myData").show()


Thanks,
Kevin

On Fri, Jan 15, 2016 at 7:00 AM, George Sigletos <sigle...@textkernel.nl>
wrote:

> According to the documentation they are exactly the same, but in my
> queries
>
> dataFrame.cache()
>
> results in much faster execution times vs doing
>
> sqlContext.cacheTable("tableName")
>
> Is there any explanation about this? I am not caching the RDD prior to
> creating the dataframe. Using Pyspark on Spark 1.5.2
>
> Kind regards,
> George
>

Re: sqlContext.cacheTable("tableName") vs dataFrame.cache()

Reply via email to