Hi Michael, Would you help me understand the apparent difference here..
The Spark 1.2.1 programming guide indicates: "Note that if you call schemaRDD.cache() rather than sqlContext.cacheTable(...), tables will *not* be cached using the in-memory columnar format, and therefore sqlContext.cacheTable(...) is strongly recommended for this use case." Yet the API doc shows that : def cache(): SchemaRDD <https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html> .this.typeOverridden cache function will always use the in-memory columnar caching. links https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD Thanks Sincerely Deb On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust <mich...@databricks.com> wrote: > From Zhan Zhang's reply, yes I still get the parquet's advantage. >> > > You will need to at least use SQL or the DataFrame API (coming in Spark > 1.3) to specify the columns that you want in order to get the parquet > benefits. The rest of your operations can be standard Spark. > > My next question is, if I operate on SchemaRdd will I get the advantage of >> Spark SQL's in memory columnar store when cached the table using >> cacheTable()? >> > > Yes, SchemaRDDs always use the in-memory columnar cache for cacheTable and > .cache() since Spark 1.2+ >