> It appears to me that Spark does not rely on statistics that are >collected by Hive on say ORC tables. > It seems that Spark uses its own optimization to query the Hive tables >irrespective of Hive has collected by way of statistics etc?
Spark does not have a cost based optimizer yet - please follow this JIRA, which suggests that it is planned for the future. <https://issues.apache.org/jira/browse/SPARK-16026> > CLUSTERED BY (PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID) INTO 256 >BUCKETS ... > Table is sorted in the order of prod_id, cust_id,time_id, channel_id and >promo_id. It has 22 million rows. Not it is not. Due to whatever backwards compatbilitiy brain-damage of Hive-1, CLUSTERED BY *DOES* not CLUSTER at all. Add at least SORTED BY (PROD_ID) if what you care about is scanning performance with the ORC indexes. > And Hive on Spark returns the same 24 rows in 30 seconds That sounds slow for 22 million rows. That should be a 5-6 second query in Hive on a single 16-core box. Is this a build from source? Has the build got log4j1.x with INFO/DEBUG? > Assuming that the time taken will be optimization time + query time then >it appears that in most cases the optimization time does not really make >that impact on the overall performance? The optimizer's impact is most felt when you have 3+ joins - computing join order, filter transitivity etc. In this case, all the optimizer does is simplify predicates. Cheers, Gopal