Thanks everybody for the advice on this.
I attached YourKit and found that the CPU time split was about 70% in
Parquet/LZO reading and 30% applying the filter predicate. I guess those
are reasonable things for it to be spending time on, and so it really could
just be a case of needing more hardwar
64GB in parquet could be many billions of rows because of the columnar
compression. And count distinct by itself is an expensive operation. This
is not just on Spark, even on Presto/Impala, you would see performance dip
with count distincts. And the cluster is not that powerful either.
The one iss
Thanks for the suggestions everyone, appreciate the advice.
I tried replacing DISTINCT for the nested GROUP BY, running on 1.4 instead
of 1.3, replacing the date casts with a "between" operation on the
corresponding long constants instead and changing COUNT(*) to COUNT(1).
None of these seem to ha
Hi James,
Maybe it's the DISTINCT causing the issue.
I rewrote the query as follows. Maybe this one can finish faster.
select
sum(cnt) as uses,
count(id) as users
from (
select
count(*) cnt,
cast(id as string) as id,
from usage_events
where
from_unixtime(cast(timestamp_mill
Generally (not only spark sql specific) you should not cast in the where
part of a sql query. It is also not necessary in your case. Getting rid of
casts in the whole query will be also beneficial.
Le lun. 22 juin 2015 à 17:29, James Aley a écrit :
> Hello,
>
> A colleague of mine ran the follow
Have you test this on a smaller set to verify that the query is correct?
On Mon, Jun 22, 2015 at 2:59 PM, ayan guha wrote:
> You may also want to change count(*) to specific column.
> On 23 Jun 2015 01:29, "James Aley" wrote:
>
>> Hello,
>>
>> A colleague of mine ran the following Spark SQL que
You may also want to change count(*) to specific column.
On 23 Jun 2015 01:29, "James Aley" wrote:
> Hello,
>
> A colleague of mine ran the following Spark SQL query:
>
> select
> count(*) as uses,
> count (distinct cast(id as string)) as users
> from usage_events
> where
> from_unixtime(ca
grading if you are not already on 1.4.
>
>
>
> Cheers,
>
> Matthew
>
>
>
>
>
> *From:* Lior Chaga [mailto:lio...@taboola.com]
> *Sent:* 22 June 2015 17:24
> *To:* James Aley
> *Cc:* user
> *Subject:* Re: Help optimising Spark SQL query
&
this
issue, so might be worth upgrading if you are not already on 1.4.
Cheers,
Matthew
*From:* Lior Chaga [mailto:lio...@taboola.com]
*Sent:* 22 June 2015 17:24
*To:* James Aley
*Cc:* user
*Subject:* Re: Help optimising Spark SQL query
Hi James,
There are a few configurations that you
Hi James,
There are a few configurations that you can try:
https://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
>From my experience, the codegen really boost things up. Just run
sqlContext.sql("spark.sql.codegen=true") before you execute your query. But
keep
Hello,
A colleague of mine ran the following Spark SQL query:
select
count(*) as uses,
count (distinct cast(id as string)) as users
from usage_events
where
from_unixtime(cast(timestamp_millis/1000 as bigint))
between '2015-06-09' and '2015-06-16'
The table contains billions of rows, but to
11 matches
Mail list logo