[ https://issues.apache.org/jira/browse/HIVE-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880122#comment-15880122 ]
Lefty Leverenz commented on HIVE-15796: --------------------------------------- By the way, the multi-line description of *hive.spark.use.op.stats* should have included newlines (\n) to avoid having a single-line description in the generated template file hive-default.xml.template. (Several hive.spark.* parameters make the same mistake in their descriptions, but *hive.spark.dynamic.partition.pruning* gets it right.) > HoS: poor reducer parallelism when operator stats are not accurate > ------------------------------------------------------------------ > > Key: HIVE-15796 > URL: https://issues.apache.org/jira/browse/HIVE-15796 > Project: Hive > Issue Type: Improvement > Components: Statistics > Affects Versions: 2.2.0 > Reporter: Chao Sun > Assignee: Chao Sun > Labels: TODOC2.2 > Fix For: 2.2.0 > > Attachments: HIVE-15796.1.patch, HIVE-15796.2.patch, > HIVE-15796.3.patch, HIVE-15796.4.patch, HIVE-15796.5.patch, > HIVE-15796.6.patch, HIVE-15796.wip.1.patch, HIVE-15796.wip.2.patch, > HIVE-15796.wip.patch > > > In HoS we use currently use operator stats to determine reducer parallelism. > However, it is often the case that operator stats are not accurate, > especially if column stats are not available. This sometimes will generate > extremely poor reducer parallelism, and cause HoS query to run forever. > This JIRA tries to offer an alternative way to compute reducer parallelism, > similar to how MR does. Here's the approach we are suggesting: > 1. when computing the parallelism for a MapWork, use stats associated with > the TableScan operator; > 2. when computing the parallelism for a ReduceWork, use the *maximum* > parallelism from all its parents. -- This message was sent by Atlassian JIRA (v6.3.15#6346)