[jira] [Commented] (HIVE-6157) Fetching column stats slower than the 101 during rush hour

Sergey Shelukhin (JIRA) Fri, 17 Jan 2014 12:21:00 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875227#comment-13875227
 ]


Sergey Shelukhin commented on HIVE-6157:
----------------------------------------

Ok, this took rather longer than expected... initially I tried to make stat 
fetching part of partition pruning, this can be added as an extra optimization 
if necessary as this requires too many API changes all over the place.
The alternative is simple, getting stat calls are all batched. New APIs on 
thrift use req/resp pattern; requests contain db, table, column list, and 
partition list (for partitions). The request returns whatever it can find 
(rather than the full list with some nulls, like the old APIs that built lists 
using individual calls to metastore). The code then uses this. 
On metastore there's both JDO and SQL path for speed.
Also, cleaned up some stuff in StatOptimizer and StatsUtil that was generally 
suboptimal.

> Fetching column stats slower than the 101 during rush hour
> ----------------------------------------------------------
>
>                 Key: HIVE-6157
>                 URL: https://issues.apache.org/jira/browse/HIVE-6157
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Gunther Hagleitner
>            Assignee: Sergey Shelukhin
>
> "hive.stats.fetch.column.stats" controls whether the column stats for a table 
> are fetched during explain (in Tez: during query planning). On my setup (1 
> table 4000 partitions, 24 columns) the time spent in semantic analyze goes 
> from ~1 second to ~66 seconds when turning the flag on. 65 seconds spent 
> fetching column stats...
> The reason is probably that the APIs force you to make separate metastore 
> calls for each column in each partition. That's probably the first thing that 
> has to change. The question is if in addition to that we need to cache this 
> in the client or store the stats as a single blob in the database to further 
> cut down on the time. However, the way it stands right now column stats seem 
> unusable.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6157) Fetching column stats slower than the 101 during rush hour

Reply via email to