[ https://issues.apache.org/jira/browse/HIVE-17095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089205#comment-16089205 ]
Hive QA commented on HIVE-17095: -------------------------------- Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12877358/HIVE-17095.2.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 13 failed/errored test(s), 11067 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] (batchId=143) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[columnstats_part_coltype] (batchId=157) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning] (batchId=167) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning_2] (batchId=169) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_explainuser_1] (batchId=168) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_use_op_stats] (batchId=167) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_use_ts_stats_for_mapjoin] (batchId=168) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning] (batchId=167) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] (batchId=99) org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema (batchId=178) org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema (batchId=178) org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation (batchId=178) org.apache.hive.jdbc.TestJdbcWithMiniHS2.testHttpRetryOnServerIdleTimeout (batchId=227) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6054/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6054/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6054/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 13 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12877358 - PreCommit-HIVE-Build > Long chain repl loads do not complete in a timely fashion > --------------------------------------------------------- > > Key: HIVE-17095 > URL: https://issues.apache.org/jira/browse/HIVE-17095 > Project: Hive > Issue Type: Bug > Components: Query Planning, repl > Reporter: sapin amin > Assignee: Sushanth Sowmyan > Attachments: HIVE-17095.2.patch, HIVE-17095.patch > > > Per performance testing done by [~sapinamin] (thus, I'm setting him as > reporter), we were able to discover an important bug affecting replication. > It has the potential to affect other large DAGs of Tasks that hive generates > as well, if those DAGs have multiple paths to child Task nodes. > Basically, we find that incremental REPL LOAD does not finish in a timely > fashion. The test, in this case was to add 400 partitions, and replicate > them. Associated with each partition, there was an ADD PTN and a ALTER PTN. > For each of the ADD PTN tasks, we'd generate a DDLTask, a CopyTask and a > MoveTask. For each Alter ptn, there'd be a single DDLTask. And order of > execution is important, so it would chain in dependency collection tasks > between phases. > Trying to root cause this shows us that it seems to stall forever at the > Driver instantiation time, and it almost looks like the thread doesn't > proceed past that point. > Looking at logs, it seems that the way this is written, it looks for all > tasks generated that are subtrees of all nodes, without looking for > duplicates, and this is done simply to get the number of execution tasks! > And thus, the task visitor will visit every subtree of every node, which is > fine if you have graphs that look like open trees, but is horrible for us, > since we have dependency collection tasks between each phase. Effectively, > this is what's happening: > We have a DAG, say, like this: > 4 tasks in parallel -> DEP col -> 4 tasks in parallel -> DEP col -> ... > This means that for each of the 4 root tasks, we will do a full traversal of > every graph (not just every node) past the DEP col, and this happens > recursively, and this leads to an exponential growth of number of tasks > visited as the length and breadth of the graph increase. In our case, we had > about 800 tasks in the graph, with roughly a width of about 2-3, with 200 > stages, a dep collection before and after, and this meant that leaf nodes of > this DAG would have something like 2^200 - 3^200 ways in which they can be > visited, and thus, we'd visit them in all those ways. And all this simply to > count the number of tasks to schedule - we would revisit this function > multiple more times, once per each hook, once for the MapReduceCompiler and > once for the TaskCompiler. > We have not been sending such large DAGs to the Driver, thus it has not yet > been a problem, and there are upcoming changes to reduce the number of tasks > replication generates(as part of a memory addressing issue), but we still > should fix the way we do Task traversal so that a large DAG cannot cripple us. -- This message was sent by Atlassian JIRA (v6.4.14#64029)