Sun,

   When Executor don't have enough memory and if it tries to cache the
data, it spends lot of time on GC and hence the job will be slow. Either,

     1. We should allocate enough memory to cache all RDD and hence the job
will complete fast
Or 2. Don't use cache when there is not enough Executor memory.

  To check the GC time, use  --conf
"spark.executor.extraJavaOptions=-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps" while submitting the job and SPARK_WORKER_DIR will
have sysout with GC.
The sysout will show many "Full GC" happening when cache is used and
executor does not have enough heap.


Thanks,
Prabhu Joseph

On Thu, Feb 4, 2016 at 11:25 AM, fightf...@163.com <fightf...@163.com>
wrote:

> Hi,
>
> I want to make sure that the cache table indeed would accelerate sql
> queries. Here is one of my use case :
>   impala table size : 24.59GB,  no partitions, with about 1 billion+ rows.
> I use sqlContext.sql to run queries over this table and try to do cache
> and uncache command to see if there
> is any performance disparity. I ran the following query :
> select * from video1203 where id > 10 and id < 20 and added_year != 1989
> I can see the following results :
>
> 1  If I did not run cache table and just ran sqlContext.sql(), I can see
> the above query run about 25 seconds.
> 2  If I firstly run sqlContext.cacheTable("video1203"), the query runs
> super slow and would cause driver OOM exception, but I can
> get final results with about running 9 minuts.
>
> Would any expert can explain this for me ? I can see that cacheTable cause
> OOM just because the in-memory columnar storage
> cannot hold the 24.59GB+ table size into memory. But why the performance
> is so different and even so bad ?
>
> Best,
> Sun.
>
> ------------------------------
> fightf...@163.com
>

Reply via email to