You probably can try something like:
val df = sqlContext.sql("select c1, sum(c2) from T1, T2 where T1.key=T2.key
group by c1")
df.cache() // Cache the result, but it's a lazy execution.
df.registerAsTempTable("my_result")
sqlContext.sql("select * from my_result where c1=1").collect // the cache
execution will be triggered here when first query on it
sqlContext.sql("select * from my_result where c1=1").collect // the cache
already there, will be very fast
And you can also cache the raw tables like:
sqlContext.cacheTable("T1")
sqlContext.cacheTable("T2")
They also will be cached when first query comes, and we will benefit from it as
it's in-memory columnar storages.
One thing you should know is the cache here cannot cross processes shared (more
precisely, cannot cross the SparkContext instance)
-----Original Message-----
From: sequoiadb [mailto:[email protected]]
Sent: Friday, May 15, 2015 11:02 AM
To: user
Subject: question about sparksql caching
Hi all,
We are planing to use SparkSQL in a DW system. There's a question about the
caching mechanism of SparkSQL.
For example, if I have a SQL like sqlContext.sql("select c1, sum(c2) from T1,
T2 where T1.key=T2.key group by c1").cache()
Is it going to cache the final result or the raw data of each table that used
in the SQL?
Since the user may have various of SQLs that use those tables, if the caching
is for the final result only, it may still take very long time to scan the
entire table if it's a brand new SQL.
If this is the case, is there any other better way to cache the base tables
instead of final result?
Thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional
commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]