[ https://issues.apache.org/jira/browse/HIVE-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thejas M Nair updated HIVE-7190: -------------------------------- Assignee: Ivan Mitic > WebHCat launcher task failure can cause two concurent user jobs to run > ---------------------------------------------------------------------- > > Key: HIVE-7190 > URL: https://issues.apache.org/jira/browse/HIVE-7190 > Project: Hive > Issue Type: Bug > Components: WebHCat > Affects Versions: 0.13.0 > Reporter: Ivan Mitic > Assignee: Ivan Mitic > Attachments: HIVE-7190.2.patch, HIVE-7190.3.patch, HIVE-7190.patch > > > Templeton uses launcher jobs to launch the actual user jobs. Launcher jobs > are 1-map jobs (a single task jobs) which kick off the actual user job and > monitor it until it finishes. Given that the launcher is a task, like any > other MR task, it has a retry policy in case it fails (due to a task crash, > tasktracker/nodemanager crash, machine level outage, etc.). Further, when > launcher task is retried, it will again launch the same user job, *however* > the previous attempt user job is already running. What this means is that we > can have two identical user jobs running in parallel. > In case of MRv2, there will be an MRAppMaster and the launcher task, which > are subject to failure. In case any of the two fails, another instance of a > user job will be launched again in parallel. > Above situation is already a bug. > Now going further to RM HA, what RM does on failover/restart is that it kills > all containers, and it restarts all applications. This means that if our > customer had 10 jobs on the cluster (this is 10 launcher jobs and 10 user > jobs), on RM failover, all 20 jobs will be restarted, and launcher jobs will > queue user jobs again. There are two issues with this design: > 1. There are *possible* chances for corruption of job outputs (it would be > useful to analyze this scenario more and confirm this statement). > 2. Cluster resources are spent on jobs redundantly > To address the issue at least on Yarn (Hadoop 2.0) clusters, webhcat should > do the same thing Oozie does in this scenario, and that is to tag all its > child jobs with an id, and kill those jobs on task restart before they are > kicked off again. -- This message was sent by Atlassian JIRA (v6.2#6252)