Thanks Kevin for your reply.

I was suspecting the same thing as well, although it still does not make
much sense to me why would you need to do both:
myData.cache()
sqlContext.cacheTable("myData")

in case you are using both sqlContext and dataframes to execute queries

dataframe.select(...) and sqlContext.sql("select ...") are equivalent, as
far as I understand

Kind regards,
George

On Fri, Jan 15, 2016 at 6:15 PM, Kevin Mellott <kevin.r.mell...@gmail.com>
wrote:

> Hi George,
>
> I believe that sqlContext.cacheTable("tableName") is to be used when you
> want to cache the data that is being used within a Spark SQL query. For
> example, take a look at the code below.
>
>
>> val myData = sqlContext.load("com.databricks.spark.csv", Map("path" ->
>> "hdfs://somepath/file", "header" -> "false").toDF("col1", "col2")
>>
> myData.registerTempTable("myData")
>
>
> Here, the usage of *cache()* will affect ONLY the *myData.select* query.
>
>> myData.cache()
>
> myData.select("col1", "col2").show()
>
>
> Here, the usage of *cacheTable* will affect ONLY the *sqlContext.sql*
>  query.
>
>> sqlContext.cacheTable("myData")
>
> sqlContext.sql("SELECT col1, col2 FROM myData").show()
>
>
> Thanks,
> Kevin
>
> On Fri, Jan 15, 2016 at 7:00 AM, George Sigletos <sigle...@textkernel.nl>
> wrote:
>
>> According to the documentation they are exactly the same, but in my
>> queries
>>
>> dataFrame.cache()
>>
>> results in much faster execution times vs doing
>>
>> sqlContext.cacheTable("tableName")
>>
>> Is there any explanation about this? I am not caching the RDD prior to
>> creating the dataframe. Using Pyspark on Spark 1.5.2
>>
>> Kind regards,
>> George
>>
>
>

Reply via email to