> It appears to me that Spark does not rely on statistics that are
>collected by Hive on say ORC tables.
> It seems that Spark uses its own optimization to query the Hive tables
>irrespective of Hive has collected by way of statistics etc?

Spark does not have a cost based optimizer yet - please follow this JIRA,
which suggests that it is planned for the future.

<https://issues.apache.org/jira/browse/SPARK-16026>


> CLUSTERED BY (PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID) INTO 256
>BUCKETS
...
> Table is sorted in the order of prod_id, cust_id,time_id, channel_id and
>promo_id. It has 22 million rows.

Not it is not.

Due to whatever backwards compatbilitiy brain-damage of Hive-1, CLUSTERED
BY *DOES* not CLUSTER at all.

Add at least

SORTED BY (PROD_ID)

if what you care about is scanning performance with the ORC indexes.


> And Hive on Spark returns the same 24 rows in 30 seconds

That sounds slow for 22 million rows. That should be a 5-6 second query in
Hive on a single 16-core box.

Is this a build from source? Has the build got log4j1.x with INFO/DEBUG?

> Assuming that the time taken will be optimization time + query time then
>it appears that in most cases the optimization time does not really make
>that impact on the overall performance?

The optimizer's impact is most felt when you have 3+ joins - computing
join order, filter transitivity etc.

In this case, all the optimizer does is simplify predicates.

Cheers,
Gopal


Reply via email to