The option of "spark.sql.hive.metastorePartitionPruning=true" will not work unless you have a partition column predicate in your query. Your query of "select * from temp.log" does not do this. The slowdown appears to be due to the need of loading all partition metadata.
Have you also tried to see if Michael's temp-table suggestion helps you cache the expensive partition lookup? (re-quoted below) """ If you run sqlContext.table("...").registerTempTable("...") that temptable will cache the lookup of partitions [the first time is slow, but subsequent lookups will be faster]. """ - X-Ref: Permalink <https://issues.apache.org/jira/browse/SPARK-6910?focusedCommentId=14529666&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14529666> Also, do you absolutely need to use "select * from temp.log"? Adding a where clause to the query with a partition condition will help Spark prune the request to just the required partitions (vs. all, which is proving expensive). On Fri, Dec 11, 2015 at 3:59 AM Isabelle Phan <nlip...@gmail.com> wrote: > Hi Michael, > > We have just upgraded to Spark 1.5.0 (actually 1.5.0_cdh-5.5 since we are > on cloudera), and Parquet formatted tables. I turned on spark > .sql.hive.metastorePartitionPruning=true, but DataFrame creation still > takes a long time. > Is there any other configuration to consider? > > > Thanks a lot for your help, > > Isabelle > > On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> If you run sqlContext.table("...").registerTempTable("...") that >> temptable will cache the lookup of partitions. >> >> On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan <nlip...@gmail.com> wrote: >> >>> Hi Michael, >>> >>> Thanks a lot for your reply. >>> >>> This table is stored as text file with tab delimited columns. >>> >>> You are correct, the problem is because my table has too many partitions >>> (1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984 >>> <https://issues.apache.org/jira/browse/SPARK-6984>. >>> >>> Not sure when my company can move to 1.5. Would you know some workaround >>> for this bug? >>> If I cannot find workaround for this, will have to change our schema >>> design to reduce number of partitions. >>> >>> >>> Thanks, >>> >>> Isabelle >>> >>> >>> >>> On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust < >>> mich...@databricks.com> wrote: >>> >>>> Also, do you mean two partitions or two partition columns? If there >>>> are many partitions it can be much slower. In Spark 1.5 I'd consider >>>> setting spark.sql.hive.metastorePartitionPruning=true if you have >>>> predicates over the partition columns. >>>> >>>> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust < >>>> mich...@databricks.com> wrote: >>>> >>>>> What format is this table. For parquet and other optimized formats we >>>>> cache a bunch of file metadata on first access to make interactive queries >>>>> faster. >>>>> >>>>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan <nlip...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I am using SparkSQL to query some Hive tables. Most of the time, when >>>>>> I create a DataFrame using sqlContext.sql("select * from table") command, >>>>>> DataFrame creation is less than 0.5 second. >>>>>> But I have this one table with which it takes almost 12 seconds! >>>>>> >>>>>> scala> val start = scala.compat.Platform.currentTime; val logs = >>>>>> sqlContext.sql("select * from temp.log"); val execution = >>>>>> scala.compat.Platform.currentTime - start >>>>>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from >>>>>> temp.log >>>>>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed >>>>>> start: Long = 1441336022731 >>>>>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int, >>>>>> log_time: string, tag: string, dt: string, test_id: int] >>>>>> execution: Long = *11567* >>>>>> >>>>>> This table has 3.6 B rows, and 2 partitions (on dt and test_id >>>>>> columns). >>>>>> I have created DataFrames on even larger tables and do not see such >>>>>> delay. >>>>>> So my questions are: >>>>>> - What can impact DataFrame creation time? >>>>>> - Is it related to the table partitions? >>>>>> >>>>>> >>>>>> Thanks much your help! >>>>>> >>>>>> Isabelle >>>>>> >>>>> >>>>> >>>> >>> >> >