That's an unfortunate documentation bug in the programming guide... We
failed to update it after making the change.
Cheng
On 2/28/15 8:13 AM, Deborah Siegel wrote:
Hi Michael,
Would you help me understand the apparent difference here..
The Spark 1.2.1 programming guide indicates:
"Note that if you call |schemaRDD.cache()| rather than
|sqlContext.cacheTable(...)|, tables will /not/ be cached using the
in-memory columnar format, and therefore
|sqlContext.cacheTable(...)| is strongly recommended for this use case."
Yet the API doc shows that :
def cache(): SchemaRDD
<https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html>.this.type
Overridden cache function will always use the in-memory
columnar caching.
links
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD
Thanks
Sincerely
Deb
On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust
<mich...@databricks.com <mailto:mich...@databricks.com>> wrote:
From Zhan Zhang's reply, yes I still get the parquet's advantage.
You will need to at least use SQL or the DataFrame API (coming in
Spark 1.3) to specify the columns that you want in order to get
the parquet benefits. The rest of your operations can be
standard Spark.
My next question is, if I operate on SchemaRdd will I get the
advantage of
Spark SQL's in memory columnar store when cached the table using
cacheTable()?
Yes, SchemaRDDs always use the in-memory columnar cache for
cacheTable and .cache() since Spark 1.2+