Hi,

I'm running a simple SQL query over a ~700 million row table of the form:

SELECT * FROM my_table WHERE id = '12345';

When I submit the query via beeline & the JDBC thrift server it returns in
35s
When I submit the exact same query using sparkSQL from a pyspark shell
(sqlContex.sql("SELECT * FROM ....")) it returns in 3s.

Both times are obtained from the spark web UI.  The query only returns 43
rows, a small amount of data.

The table was created by saving a sparkSQL dataframe as a parquet file and
then calling createExternalTable.

I have tried to ensure that all relevant cluster parameters are equivalent
across the two queries:
spark.executor.memory = 6g
spark.executor.instances = 100
no explicit caching (storage tab in web UI is empty)
spark version: 1.4.1
Hadoop v2.5.0-cdh5.3.0, running spark on top of YARN
jobs run on the same physical cluster (on-site harware)

>From the web UIs, I can see that the query plans are clearly different, and
I think this may be the source of the performance difference.

Thrift server job:
1 stage only, stage 1 (35s) map -> Filter -> mapPartitions

SparkSQL job:
2 stages, stage 1 (2s): map -> filter -> Project -> Aggregate -> Exchange,
stage 2 (0.4s): Exchange -> Aggregate -> mapPartitions

Is this a know issue?  Is there anything I can do to get the Thrift server
to use the same query optimizer as the one used by sparkSQL?  I'd love to
pick up a ~10x performance gain for my jobs submitted via the Thrift server.

Best regards,

Jeff

Reply via email to