That's an unfortunate documentation bug in the programming guide... We failed to update it after making the change.

Cheng

On 2/28/15 8:13 AM, Deborah Siegel wrote:
Hi Michael,

Would you help me understand  the apparent difference here..

The Spark 1.2.1 programming guide indicates:

"Note that if you call |schemaRDD.cache()| rather than |sqlContext.cacheTable(...)|, tables will /not/ be cached using the in-memory columnar format, and therefore |sqlContext.cacheTable(...)| is strongly recommended for this use case."

Yet the API doc shows that :


        def cache(): SchemaRDD
        
<https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html>.this.type


        Overridden cache function will always use the in-memory
        columnar caching.



links
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD

Thanks
Sincerely
Deb

On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust <mich...@databricks.com <mailto:mich...@databricks.com>> wrote:

        From Zhan Zhang's reply, yes I still get the parquet's advantage.

    You will need to at least use SQL or the DataFrame API (coming in
    Spark 1.3) to specify the columns that you want in order to get
    the parquet benefits.   The rest of your operations can be
    standard Spark.

        My next question is, if I operate on SchemaRdd will I get the
        advantage of
        Spark SQL's in memory columnar store when cached the table using
        cacheTable()?


    Yes, SchemaRDDs always use the in-memory columnar cache for
    cacheTable and .cache() since Spark 1.2+



Reply via email to