[jira] [Commented] (HIVE-7190) WebHCat launcher task failure can cause two concurent user jobs to run

Hive QA (JIRA) Tue, 17 Jun 2014 05:27:19 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033726#comment-14033726
 ]


Hive QA commented on HIVE-7190:
-------------------------------



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12649960/HIVE-7190.3.patch

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 5537 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_parquet_columnar
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_ctas
org.apache.hadoop.hive.ql.exec.tez.TestTezTask.testSubmit
org.apache.hive.hcatalog.pig.TestHCatLoader.testReadDataPrimitiveTypes
org.apache.hive.jdbc.miniHS2.TestHiveServer2.testConnection
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/489/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/489/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-Build-489/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12649960

> WebHCat launcher task failure can cause two concurent user jobs to run
> ----------------------------------------------------------------------
>
>                 Key: HIVE-7190
>                 URL: https://issues.apache.org/jira/browse/HIVE-7190
>             Project: Hive
>          Issue Type: Bug
>          Components: WebHCat
>    Affects Versions: 0.13.0
>            Reporter: Ivan Mitic
>            Assignee: Ivan Mitic
>         Attachments: HIVE-7190.2.patch, HIVE-7190.3.patch, HIVE-7190.patch
>
>
> Templeton uses launcher jobs to launch the actual user jobs. Launcher jobs 
> are 1-map jobs (a single task jobs) which kick off the actual user job and 
> monitor it until it finishes. Given that the launcher is a task, like any 
> other MR task, it has a retry policy in case it fails (due to a task crash, 
> tasktracker/nodemanager crash, machine level outage, etc.). Further, when 
> launcher task is retried, it will again launch the same user job, *however* 
> the previous attempt user job is already running. What this means is that we 
> can have two identical user jobs running in parallel. 
> In case of MRv2, there will be an MRAppMaster and the launcher task, which 
> are subject to failure. In case any of the two fails, another instance of a 
> user job will be launched again in parallel. 
> Above situation is already a bug.
> Now going further to RM HA, what RM does on failover/restart is that it kills 
> all containers, and it restarts all applications. This means that if our 
> customer had 10 jobs on the cluster (this is 10 launcher jobs and 10 user 
> jobs), on RM failover, all 20 jobs will be restarted, and launcher jobs will 
> queue user jobs again. There are two issues with this design:
> 1. There are *possible* chances for corruption of job outputs (it would be 
> useful to analyze this scenario more and confirm this statement).
> 2. Cluster resources are spent on jobs redundantly
> To address the issue at least on Yarn (Hadoop 2.0) clusters, webhcat should 
> do the same thing Oozie does in this scenario, and that is to tag all its 
> child jobs with an id, and kill those jobs on task restart before they are 
> kicked off again.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7190) WebHCat launcher task failure can cause two concurent user jobs to run

Reply via email to