Thanks for the tip, Gopal. I documented hive.limit.pushdown.memory.usage <https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.limit.pushdown.memory.usage> in the Configuration Properties wiki but had a couple of questions about the description (see the comment on HIVE-3562 <https://issues.apache.org/jira/browse/HIVE-3562?focusedCommentId=14392243&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14392243> ).
-- Lefty On Mon, Mar 30, 2015 at 12:42 AM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > Hi, > > >Been experimenting a little with vectorized execution in hive 0.13 and > >found that group-by is super slow on string columns. This simple query is > >13x slower when vectorization is enabled (c_customer_id is string). Don't > >see this problem with int types. > > I think the performance issue is due to the row-count triggers for > flushing the in-memory aggregations. > > This shouldn¹t happen to you in the hive-1.0 branch, but for 0.13 there is > a fairly easy workaround to the performance issue. > > >select c_customer_id from customer group by c_customer_id limit 10; > > A very odd query that one, since it is one of the few patterns which > speeds up with an extra ORDER BY. > > select c_customer_id from customer group by c_customer_id order by > c_customer_id limit 10; > > tends to run faster than regular group-by + fetch limit as it shuffles > less data (10 keys per map task). > > Try the same with > > set hive.vectorized.groupby.checkinterval=1024; > set hive.vectorized.groupby.flush.percent=0.8; > set hive.limit.pushdown.memory.usage=0.04; > > set hive.optimize.reducededuplication.min.reducer=1; > # above only if you¹re on MRv2, in Tez the default (4) is the faster option > > That combination of operators should be triggering the fastest codepath. > > @lefty: the limit pushdown seems to be missing in docs as the Top-N memory > size. > > Cheers, > Gopal > > >