: user
Date: 2016/10/25 17:33
Subject:Re: Spark SQL is slower when DataFrame is cache in Memory
Hi Kazuaki,
I print a debug log right before I call the collect, and use that to
compare against the job start log (it is available when turning on debug
log).
Anyway, I test that in
gt;
> Best Regards,
> Kazuaki Ishizaki
>
>
>
> From:Chin Wei Low
> To:Kazuaki Ishizaki/Japan/IBM@IBMJP
> Cc:user@spark.apache.org
> Date: 2016/10/10 11:33
>
> Subject: Re: Spark SQL is slower when DataFrame is cache in Memory
>
:Re: Spark SQL is slower when DataFrame is cache in Memory
Hi Ishizaki san,
Thanks for the reply.
So, when I pre-cache the dataframe, the cache is being used during the job
execution.
Actually there are 3 events:
1. call res.collect
2. job started
3. job completed
I am concerning
;)
> res.explain(true)
> res.collect()
>
> Do I make some misunderstandings?
>
> Best Regards,
> Kazuaki Ishizaki
>
>
>
> From:Chin Wei Low
> To:Kazuaki Ishizaki/Japan/IBM@IBMJP
> Cc: user@spark.apache.org
> Date:
e.org
Date: 2016/10/07 20:06
Subject: Re: Spark SQL is slower when DataFrame is cache in Memory
Hi Ishizaki san,
So there is a gap between res.collect
and when I see this log: spark.SparkContext: Starting job: collect at
:26
What you mean is, during this time Spark already start to
Hi Ishizaki san,
So there is a gap between res.collect
and when I see this log: spark.SparkContext: Starting job: collect at
:26
What you mean is, during this time Spark already start to get data from
cache? Isn't it should only get the data after the job is started and tasks
are distributed?
Hi,
I think that the result looks correct. The current Spark spends extra time
for getting data from a cache. There are two reasons. One is for a
complicated path to get a data. The other is for decompression in the case
of a primitive type.
The new implementation (https://github.com/apache/spar