[ https://issues.apache.org/jira/browse/HIVE-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257922#comment-14257922 ]
Navis commented on HIVE-7661: ----------------------------- [~vishal.kamath] Booked an issue for split sampling described above. Wish it's helpful. > Observed performance issues while sorting using Hive's Parallel Order by > clause while retaining pre-existing sort order. > ------------------------------------------------------------------------------------------------------------------------ > > Key: HIVE-7661 > URL: https://issues.apache.org/jira/browse/HIVE-7661 > Project: Hive > Issue Type: Bug > Components: Logical Optimizer > Affects Versions: 0.12.0 > Environment: Cloudera 5.0 > hive-0.12.0-cdh5.0.0 > Red Hat Linux > Reporter: Vishal Kamath > Labels: performance > Fix For: 0.12.1 > > > Improve Hive's sampling logic to accommodate use cases that require to retain > the pre existing sort in the underlying source table. > In order to support Parallel order by clause, Hive Samples the source table > based on values provided to hive.optimize.sampling.orderby.number and > hive.optimize.sampling.orderby.percent. > This does work with reasonable performance when sorting is performed on a > columns having random distribution of data but has severe performance issues > when retaining the sort order. > Let us try to understand this with an example. > insert overwrite table lineitem_temp_report > select > l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, > l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, > l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment > from > lineitem > order by l_orderkey, l_partkey, l_suppkey; > Sample data set for lineitem table. The first column represents the > l_orderKey and is sorted. > > l_orderkey|l_partkey|l_suppkey|l_linenumber|l_quantity|l_extendedprice|l_discount|l_tax|l_returnflag|l_linestatus|l_shipdate|l_commitdate|l_receiptdate|l_shipinstruct|l_shipmode|l_comment > 197|1771022|96040|2|8|8743.52|0.09|0.02|A|F|1995-04-17|1995-07-01|1995-0 > 197|1771022|96040|2|8|4-27|DELIVER IN PERSON|SHIP|y blithely even > 197|1771022|96040|2|8|deposits. blithely fina| > 197|1558290|83306|3|17|22919.74|0.06|0.02|N|O|1995-08-02|1995-06-23|1995 > 197|1558290|83306|3|17|-08-03|COLLECT COD|REG AIR|ts. careful| > 197|179355|29358|4|25|35858.75|0.04|0.01|N|F|1995-06-13|1995-05-23|1995- > 197|179355|29358|4|25|06-24|TAKE BACK RETURN|FOB|s-- quickly final > 197|179355|29358|4|25|accounts| > 197|414653|39658|5|14|21946.82|0.09|0.01|R|F|1995-05-08|1995-05-24|1995- > 197|414653|39658|5|14|05-12|TAKE BACK RETURN|RAIL|use slyly slyly silent > 197|414653|39658|5|14|depo| > 197|1058800|8821|6|1|1758.75|0.07|0.05|N|O|1995-07-15|1995-06-21|1995-08 > 197|1058800|8821|6|1|-11|COLLECT COD|RAIL| even, thin dependencies sno| > 198|560609|60610|1|33|55096.14|0.07|0.02|N|O|1998-01-05|1998-03-20|1998- > 198|560609|60610|1|33|01-10|TAKE BACK RETURN|TRUCK|carefully caref| > 198|152287|77289|2|20|26785.60|0.03|0.00|N|O|1998-01-15|1998-03-31|1998- > 198|152287|77289|2|20|01-25|DELIVER IN PERSON|FOB|carefully final > 198|152287|77289|2|20|escapades a| > 224|1899665|74720|3|41|68247.37|0.07|0.04|A|F|1994-09-01|1994-09-15|1994 > 224|1899665|74720|3|41|-09-02|TAKE BACK RETURN|SHIP|after the furiou| > When we try to either sort on a presorted column or do a multi-column sort > while trying to retain the sort order on the source table, > Source table "lineitem" has 600 million rows. > We don't see equal distribution of data to the reducers. Out of 100 reducers, > 99 complete in less than 40 seconds. The last reducer is doing the bulk of > the work processing nearly 570 million rows. > So, let us understand what is going wrong here .. > on a table having 600 million records with orderkey column sorted, i created > temp table with 10% sampling. > insert overwrite table sampTempTbl (select * from lineitem tablesample (10 > percent) t); > select min(l_orderkey), max(l_orderkey) from sampTempTbl ; > 12306309, 142321700 > where as on the source table, the orderkey range (select min(l_orderkey), > max(l_orderkey) from lineitem) is 1 and 600000000 > So naturally bulk of the records will be directed towards single reducer. > One way to work around this problem is to increase the > hive.optimize.sampling.orderby.number to a larger value (as close as the # > rows in the input source table). But then we will have to provide higher heap > (hive-env.sh) for hive, otherwise it will fail while creating the Sampling > Data. With larger data volume, it is not practical to sample the entire data > set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)