[ https://issues.apache.org/jira/browse/HIVE-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874880#comment-15874880 ]
Chao Sun commented on HIVE-15796: --------------------------------- [~xuefuz] bq. Above there is a TestSparkCliDriver test failure and I'm wondering if it's related. the test failure is not related to this patch - I tested it locally without the patch and it still failed. I'll create a follow-up JIRA to fix it. bq. Secondly, it looks like we added one more path when setting reducer parallelism. Please explain a bit. (I remember we didn't have that in a previous patch that I reviewed.) Using two passes to avoid plan change, as you noted previously? Yes, the latest patch uses a separate pass to process reducer parallelism. Using a single pass caused quite a lot plan changes, and some of the change are non-trivial (e.g., join operator was moved from one work to another). So I added this just for safety reason. Let me know if you think the previous approach is better. > HoS: poor reducer parallelism when operator stats are not accurate > ------------------------------------------------------------------ > > Key: HIVE-15796 > URL: https://issues.apache.org/jira/browse/HIVE-15796 > Project: Hive > Issue Type: Improvement > Components: Statistics > Affects Versions: 2.2.0 > Reporter: Chao Sun > Assignee: Chao Sun > Attachments: HIVE-15796.1.patch, HIVE-15796.2.patch, > HIVE-15796.3.patch, HIVE-15796.4.patch, HIVE-15796.5.patch, > HIVE-15796.6.patch, HIVE-15796.wip.1.patch, HIVE-15796.wip.2.patch, > HIVE-15796.wip.patch > > > In HoS we use currently use operator stats to determine reducer parallelism. > However, it is often the case that operator stats are not accurate, > especially if column stats are not available. This sometimes will generate > extremely poor reducer parallelism, and cause HoS query to run forever. > This JIRA tries to offer an alternative way to compute reducer parallelism, > similar to how MR does. Here's the approach we are suggesting: > 1. when computing the parallelism for a MapWork, use stats associated with > the TableScan operator; > 2. when computing the parallelism for a ReduceWork, use the *maximum* > parallelism from all its parents. -- This message was sent by Atlassian JIRA (v6.3.15#6346)