Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-27 Thread Kazuaki Ishizaki
: user Date: 2016/10/25 17:33 Subject:Re: Spark SQL is slower when DataFrame is cache in Memory Hi Kazuaki, I print a debug log right before I call the collect, and use that to compare against the job start log (it is available when turning on debug log). Anyway, I test that in

Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-25 Thread Chin Wei Low
gt; > Best Regards, > Kazuaki Ishizaki > > > > From:Chin Wei Low > To:Kazuaki Ishizaki/Japan/IBM@IBMJP > Cc:user@spark.apache.org > Date: 2016/10/10 11:33 > > Subject: Re: Spark SQL is slower when DataFrame is cache in Memory >

Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-24 Thread Kazuaki Ishizaki
:Re: Spark SQL is slower when DataFrame is cache in Memory Hi Ishizaki san, Thanks for the reply. So, when I pre-cache the dataframe, the cache is being used during the job execution. Actually there are 3 events: 1. call res.collect 2. job started 3. job completed I am concerning

Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-09 Thread Chin Wei Low
;) > res.explain(true) > res.collect() > > Do I make some misunderstandings? > > Best Regards, > Kazuaki Ishizaki > > > > From:Chin Wei Low > To:Kazuaki Ishizaki/Japan/IBM@IBMJP > Cc: user@spark.apache.org > Date:

Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-07 Thread Kazuaki Ishizaki
e.org Date: 2016/10/07 20:06 Subject: Re: Spark SQL is slower when DataFrame is cache in Memory Hi Ishizaki san, So there is a gap between res.collect and when I see this log: spark.SparkContext: Starting job: collect at :26 What you mean is, during this time Spark already start to

Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-07 Thread Chin Wei Low
Hi Ishizaki san, So there is a gap between res.collect and when I see this log: spark.SparkContext: Starting job: collect at :26 What you mean is, during this time Spark already start to get data from cache? Isn't it should only get the data after the job is started and tasks are distributed?

Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-07 Thread Kazuaki Ishizaki
Hi, I think that the result looks correct. The current Spark spends extra time for getting data from a cache. There are two reasons. One is for a complicated path to get a data. The other is for decompression in the case of a primitive type. The new implementation (https://github.com/apache/spar