If you are running on Spark 1.1 or earlier you'll want to use
rdd.registerTempTable(<tableName>) followed by
sqlContext.cacheTable(<tableName>) and then query that table.  rdd.cache()
is not using the optimized in-memory format and thus puts a lot of pressure
on the GC.  This is fixed in Spark 1.2 and .cache() should do what you want.

I'll also note that the caching in SQL will actually make things slower if
the data does not fit in memory.  So you should look in the storage tab of
the Spark Web UI and make sure all the partitions are fitting.

On Sun, Nov 2, 2014 at 8:47 PM, Shailesh Birari <sbir...@wynyardgroup.com>
wrote:

> Hello,
>
> I have written an Spark SQL application which reads data from HDFS  and
> query on it.
> The data size is around 2GB (30 million records). The schema and query I am
> running is as below.
> The query takes around 05+ seconds to execute.
> I tried by adding
>        rdd.persist(StorageLevel.MEMORY_AND_DISK)
> and
>        rdd.cache()
> but in both the cases it takes extra time, even if I give the below query
> as
> second the data. (assuming Spark will cache it for first query).
>
> case class EventDataTbl(ID: String,
>                 ONum: String,
>                 RNum: String,
>                 Timestamp: String,
>                 Duration: String,
>                 Type: String,
>                 Source: String,
>                 OName: String,
>                 RName: String)
>
> sql("SELECT COUNT(*) AS Frequency,ONum,OName,RNum,RName FROM EventDataTbl
> GROUP BY ONum,OName,RNum,RName ORDER BY Frequency DESC LIMIT
> 10").collect().foreach(println)
>
> Can you let me know if I am missing anything ?
>
> Thanks,
>   Shailesh
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-takes-unexpected-time-tp17925.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to