Re: Running spark function on parquet without sql

Cheng Lian Sun, 15 Mar 2015 10:36:55 -0700

That's an unfortunate documentation bug in the programming guide... Wefailed to update it after making the change.


Cheng


On 2/28/15 8:13 AM, Deborah Siegel wrote:

Hi Michael,

Would you help me understand  the apparent difference here..

The Spark 1.2.1 programming guide indicates:


Yet the API doc shows that :


        def cache(): SchemaRDD
        
<https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html>.this.type


        Overridden cache function will always use the in-memory
        columnar caching.



links
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD

Thanks
Sincerely
Deb

On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust<mich...@databricks.com <mailto:mich...@databricks.com>> wrote:


        From Zhan Zhang's reply, yes I still get the parquet's advantage.

    You will need to at least use SQL or the DataFrame API (coming in
    Spark 1.3) to specify the columns that you want in order to get
    the parquet benefits.   The rest of your operations can be
    standard Spark.

        My next question is, if I operate on SchemaRdd will I get the
        advantage of
        Spark SQL's in memory columnar store when cached the table using
        cacheTable()?


    Yes, SchemaRDDs always use the in-memory columnar cache for
    cacheTable and .cache() since Spark 1.2+

Re: Running spark function on parquet without sql

Reply via email to