Hi,
I'm running a simple SQL query over a ~700 million row table of the form:
SELECT * FROM my_table WHERE id = '12345';
When I submit the query via beeline & the JDBC thrift server it returns in
35s
When I submit the exact same query using sparkSQL from a pyspark shell
(sqlContex.sql("SELECT * FROM ....")) it returns in 3s.
Both times are obtained from the spark web UI. The query only returns 43
rows, a small amount of data.
The table was created by saving a sparkSQL dataframe as a parquet file and
then calling createExternalTable.
I have tried to ensure that all relevant cluster parameters are equivalent
across the two queries:
spark.executor.memory = 6g
spark.executor.instances = 100
no explicit caching (storage tab in web UI is empty)
spark version: 1.4.1
Hadoop v2.5.0-cdh5.3.0, running spark on top of YARN
jobs run on the same physical cluster (on-site harware)
>From the web UIs, I can see that the query plans are clearly different, and
I think this may be the source of the performance difference.
Thrift server job:
1 stage only, stage 1 (35s) map -> Filter -> mapPartitions
SparkSQL job:
2 stages, stage 1 (2s): map -> filter -> Project -> Aggregate -> Exchange,
stage 2 (0.4s): Exchange -> Aggregate -> mapPartitions
Is this a know issue? Is there anything I can do to get the Thrift server
to use the same query optimizer as the one used by sparkSQL? I'd love to
pick up a ~10x performance gain for my jobs submitted via the Thrift server.
Best regards,
Jeff