[ https://issues.apache.org/jira/browse/HIVE-17677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Sherman reassigned HIVE-17677: ------------------------------------- Assignee: (was: Andrew Sherman) > Investigate using hive statistics information to optimize HoS parallel order > by > ------------------------------------------------------------------------------- > > Key: HIVE-17677 > URL: https://issues.apache.org/jira/browse/HIVE-17677 > Project: Hive > Issue Type: Improvement > Affects Versions: 3.0.0 > Reporter: Andrew Sherman > Priority: Major > > I think Spark's native parallel order by works in a similar way to what we do > for Hive-on-MR. That is, it scans the RDD once and sample the data to > determine what ranges the data should be partitioned into, and then scans the > RDD again to do the actual order by (with multiple reducers). > One optimization suggested by [~stakiar] is that if we have column stats > about the col we are ordering by, then the first scan on the RDD is not > necessary. If we have histogram data about the RDD, we already know what the > ranges of the order by should be. This should work when running parallel > order by on simple tables, will be harder when we run it on derived datasets > (although not impossible). > To do his we would have to understand more about the internals of > JavaPairRDD. -- This message was sent by Atlassian JIRA (v7.6.3#76005)