Re: Running spark function on parquet without sql

Deborah Siegel Fri, 27 Feb 2015 16:14:13 -0800

Hi Michael,

Would you help me understand  the apparent difference here..


The Spark 1.2.1 programming guide indicates:

"Note that if you call schemaRDD.cache() rather than
sqlContext.cacheTable(...), tables will *not* be cached using the in-memory
columnar format, and therefore sqlContext.cacheTable(...) is strongly
recommended for this use case."

Yet the API doc shows that :
def cache(): SchemaRDD
<https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html>
.this.typeOverridden cache function will always use the in-memory columnar
caching.


links
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD

Thanks
Sincerely
Deb

On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> From Zhan Zhang's reply, yes I still get the parquet's advantage.
>>
>
> You will need to at least use SQL or the DataFrame API (coming in Spark
> 1.3) to specify the columns that you want in order to get the parquet
> benefits.   The rest of your operations can be standard Spark.
>
> My next question is, if I operate on SchemaRdd will I get the advantage of
>> Spark SQL's in memory columnar store when cached the table using
>> cacheTable()?
>>
>
> Yes, SchemaRDDs always use the in-memory columnar cache for cacheTable and
> .cache() since Spark 1.2+
>

Re: Running spark function on parquet without sql

Reply via email to