Re: Review Request 24688: parallel order by clause on a string column fails with IOException: Split points are out of order

Navis Ryu Mon, 25 Aug 2014 19:32:07 -0700


> On Aug. 25, 2014, 7:08 p.m., Szehon Ho wrote:
> > Looks like an important bug to fix, but I dont know too much about this 
> > code, can you explain what is the bug in the getPartitionKey algorithm, and 
> > what is the fix?  Like why we need to alter the stepSize as we iterate.  Is 
> > there a test we can add for this as well to illustrate and validate the fix?
> > 
> > Also my confusion is if the other fixes on the patch are related?
> > 
> > 1.  Adding setConf on the HiveTotalOrderPartitioner is related to the bug?
> > 2.  What is the use of the new HiveConf "..min.reducer"?  My guess is you 
> > found the algorithm not generating enough partitionKey sometimes, can you 
> > explain?


If sampled data is like "10 10 10 10 10 20" with 3 reducers, stepSize=2 (6/3), 
which makes datum[2], datum[4] as partition key. But beacuse these two are same 
value(10), TotalOrderPartitioner throws exception. I'll try to make a proper 
test case for this.

The code part 
```
while (last >= k && C.compare(sorted[last], sorted[k]) == 0) {
  ++;
}
```      
Intended to increase index if previous key is same with current key to make a 
difference, but it does not work because "last >= k" is always true when i > 1.

bq. Adding setConf on the HiveTotalOrderPartitioner is related to the bug?

As commented, it's temporary fix for TEZ-1403, which is a bug not providing 
JobConf instance to Partitioners.

bq. What is the use of the new HiveConf "..min.reducer"?

If sampled data does not contain more unique values than the number of reducer, 
it's possible to return smaller number of partitions than expected (in current 
implementation as you thought). It can be bad if someone wants 100 reducer 
running but it's running only 4 reducer.


> On Aug. 25, 2014, 7:08 p.m., Szehon Ho wrote:
> > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 1038
> > <https://reviews.apache.org/r/24688/diff/1/?file=660082#file660082line1038>
> >
> >     If this needs to be exposed, should be worded better.  Something like:
> >     
> >     name = "hive.optimize.sampling.orderby.min.reducer.ratio"
> >     
> >     "If sampling is enabled, this is the minimum ratio allowed of reducers 
> > calculated by sampling to expected number of reducers".
> >     
> >     Its might be confusing to user in my opinion, as the user has little 
> > control of what the expected reducer is, right?

Good!


- Navis


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24688/#review51426
-----------------------------------------------------------


On Aug. 14, 2014, 2:29 a.m., Navis Ryu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24688/
> -----------------------------------------------------------
> 
> (Updated Aug. 14, 2014, 2:29 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-7669
>     https://issues.apache.org/jira/browse/HIVE-7669
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The source table has 600 Million rows and it has a String column 
> "l_shipinstruct" which has 4 unique values. (Ie. these 4 values are repeated 
> across the 600 million rows)
> 
> We are sorting it based on this string column "l_shipinstruct" as shown in 
> the below HiveQL with the following parameters. 
> {code:sql}
> set hive.optimize.sampling.orderby=true;
> set hive.optimize.sampling.orderby.number=10000000;
> set hive.optimize.sampling.orderby.percent=0.1f;
> 
> insert overwrite table lineitem_temp_report 
> select 
>   l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, 
> l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, 
> l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment
> from 
>   lineitem
> order by l_shipinstruct;
> {code}
> Stack Trace
> Diagnostic Messages for this Task:
> {noformat}
> Error: java.lang.RuntimeException: Error in configuring object
>         at 
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
>         at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
>         at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>         at 
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.<init>(MapTask.java:569)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> Caused by: java.lang.reflect.InvocationTargetException
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:601)
>         at 
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
>         ... 10 more
> Caused by: java.lang.IllegalArgumentException: Can't read partitions file
>         at 
> org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:116)
>         at 
> org.apache.hadoop.mapred.lib.TotalOrderPartitioner.configure(TotalOrderPartitioner.java:42)
>         at 
> org.apache.hadoop.hive.ql.exec.HiveTotalOrderPartitioner.configure(HiveTotalOrderPartitioner.java:37)
>         ... 15 more
> Caused by: java.io.IOException: Split points are out of order
>         at 
> org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:96)
>         ... 17 more
> {noformat}
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java af9e198 
>   common/src/java/org/apache/hadoop/hive/conf/Validator.java cea9c41 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/HiveTotalOrderPartitioner.java 
> 6c22362 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/PartitionKeySampler.java 166461a 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecDriver.java ef72039 
> 
> Diff: https://reviews.apache.org/r/24688/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Navis Ryu
> 
>

Re: Review Request 24688: parallel order by clause on a string column fails with IOException: Split points are out of order

Reply via email to