Re: DataFrame creation delay?

Harsh J Thu, 10 Dec 2015 19:39:07 -0800

The option of "spark.sql.hive.metastorePartitionPruning=true" will not work
unless you have a partition column predicate in your query. Your query of
"select * from temp.log" does not do this. The slowdown appears to be due
to the need of loading all partition metadata.


Have you also tried to see if Michael's temp-table suggestion helps you
cache the expensive partition lookup? (re-quoted below)

"""
If you run sqlContext.table("...").registerTempTable("...") that temptable
will cache the lookup of partitions [the first time is slow, but subsequent
lookups will be faster].
""" - X-Ref: Permalink
<https://issues.apache.org/jira/browse/SPARK-6910?focusedCommentId=14529666&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14529666>

Also, do you absolutely need to use "select * from temp.log"? Adding a
where clause to the query with a partition condition will help Spark prune
the request to just the required partitions (vs. all, which is proving
expensive).

On Fri, Dec 11, 2015 at 3:59 AM Isabelle Phan <nlip...@gmail.com> wrote:

> Hi Michael,
>
> We have just upgraded to Spark 1.5.0 (actually 1.5.0_cdh-5.5 since we are
> on cloudera), and Parquet formatted tables. I turned on  spark
> .sql.hive.metastorePartitionPruning=true, but DataFrame creation still
> takes a long time.
> Is there any other configuration to consider?
>
>
> Thanks a lot for your help,
>
> Isabelle
>
> On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> If you run sqlContext.table("...").registerTempTable("...") that
>> temptable will cache the lookup of partitions.
>>
>> On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan <nlip...@gmail.com> wrote:
>>
>>> Hi Michael,
>>>
>>> Thanks a lot for your reply.
>>>
>>> This table is stored as text file with tab delimited columns.
>>>
>>> You are correct, the problem is because my table has too many partitions
>>> (1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984
>>> <https://issues.apache.org/jira/browse/SPARK-6984>.
>>>
>>> Not sure when my company can move to 1.5. Would you know some workaround
>>> for this bug?
>>> If I cannot find workaround for this, will have to change our schema
>>> design to reduce number of partitions.
>>>
>>>
>>> Thanks,
>>>
>>> Isabelle
>>>
>>>
>>>
>>> On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> Also, do you mean two partitions or two partition columns?  If there
>>>> are many partitions it can be much slower.  In Spark 1.5 I'd consider
>>>> setting spark.sql.hive.metastorePartitionPruning=true if you have
>>>> predicates over the partition columns.
>>>>
>>>> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> What format is this table.  For parquet and other optimized formats we
>>>>> cache a bunch of file metadata on first access to make interactive queries
>>>>> faster.
>>>>>
>>>>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan <nlip...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am using SparkSQL to query some Hive tables. Most of the time, when
>>>>>> I create a DataFrame using sqlContext.sql("select * from table") command,
>>>>>> DataFrame creation is less than 0.5 second.
>>>>>> But I have this one table with which it takes almost 12 seconds!
>>>>>>
>>>>>> scala>  val start = scala.compat.Platform.currentTime; val logs =
>>>>>> sqlContext.sql("select * from temp.log"); val execution =
>>>>>> scala.compat.Platform.currentTime - start
>>>>>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from
>>>>>> temp.log
>>>>>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
>>>>>> start: Long = 1441336022731
>>>>>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int,
>>>>>> log_time: string, tag: string, dt: string, test_id: int]
>>>>>> execution: Long = *11567*
>>>>>>
>>>>>> This table has 3.6 B rows, and 2 partitions (on dt and test_id
>>>>>> columns).
>>>>>> I have created DataFrames on even larger tables and do not see such
>>>>>> delay.
>>>>>> So my questions are:
>>>>>> - What can impact DataFrame creation time?
>>>>>> - Is it related to the table partitions?
>>>>>>
>>>>>>
>>>>>> Thanks much your help!
>>>>>>
>>>>>> Isabelle
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: DataFrame creation delay?

Reply via email to