Hi, I'm running a simple SQL query over a ~700 million row table of the form:
SELECT * FROM my_table WHERE id = '12345'; When I submit the query via beeline & the JDBC thrift server it returns in 35s When I submit the exact same query using sparkSQL from a pyspark shell (sqlContex.sql("SELECT * FROM ....")) it returns in 3s. Both times are obtained from the spark web UI. The query only returns 43 rows, a small amount of data. The table was created by saving a sparkSQL dataframe as a parquet file and then calling createExternalTable. I have tried to ensure that all relevant cluster parameters are equivalent across the two queries: spark.executor.memory = 6g spark.executor.instances = 100 no explicit caching (storage tab in web UI is empty) spark version: 1.4.1 Hadoop v2.5.0-cdh5.3.0, running spark on top of YARN jobs run on the same physical cluster (on-site harware) >From the web UIs, I can see that the query plans are clearly different, and I think this may be the source of the performance difference. Thrift server job: 1 stage only, stage 1 (35s) map -> Filter -> mapPartitions SparkSQL job: 2 stages, stage 1 (2s): map -> filter -> Project -> Aggregate -> Exchange, stage 2 (0.4s): Exchange -> Aggregate -> mapPartitions Is this a know issue? Is there anything I can do to get the Thrift server to use the same query optimizer as the one used by sparkSQL? I'd love to pick up a ~10x performance gain for my jobs submitted via the Thrift server. Best regards, Jeff