[ https://issues.apache.org/jira/browse/HIVE-7803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sushanth Sowmyan updated HIVE-7803: ----------------------------------- Resolution: Fixed Status: Resolved (was: Patch Available) (Closing as duplicate without committing, since this functionality is subsumed and improved by HIVE-8394) > Enable Hadoop speculative execution may cause corrupt output directory > (dynamic partition) > ------------------------------------------------------------------------------------------ > > Key: HIVE-7803 > URL: https://issues.apache.org/jira/browse/HIVE-7803 > Project: Hive > Issue Type: Bug > Components: HCatalog > Affects Versions: 0.13.1 > Environment: > Reporter: Selina Zhang > Assignee: Selina Zhang > Priority: Critical > Attachments: HIVE-7803.1.patch, HIVE-7803.2.patch > > > One of our users reports they see intermittent failures due to attempt > directories in the input paths. We found with speculative execution turned > on, two mappers tried to commit task at the same time using the same > committed task path, which cause the corrupt output directory. > The original Pig script: > {code} > STORE AdvertiserDataParsedClean INTO '$DB_NAME.$ADVERTISER_META_TABLE_NAME' > USING org.apache.hcatalog.pig.HCatStorer(); > {code} > Two mappers > attempt_1405021984947_5394024_m_000523_0: KILLED > attempt_1405021984947_5394024_m_000523_1: SUCCEEDED > attempt_1405021984947_5394024_m_000523_0 was killed right after the commit. > As a result, it created corrupt directory as > > /projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523/ > containing > part-m-00523 (from attempt_1405021984947_5394024_m_000523_0) > and > attempt_1405021984947_5394024_m_000523_1/part-m-00523 > Namenode Audit log > ========================== > 1. 2014-08-05 05:04:36,811 INFO FSNamesystem.audit: ugi=* ip=ipaddress1 > cmd=create > src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_0/part-m-00523 > dst=null perm=user:group:rw-r----- > 2. 2014-08-05 05:04:53,112 INFO FSNamesystem.audit: ugi=* ip=ipaddress2 > cmd=create > src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_1/part-m-00523 > dst=null perm=user:group:rw-r----- > 3. 2014-08-05 05:05:13,001 INFO FSNamesystem.audit: ugi=* ip=ipaddress1 > cmd=rename > src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_0 > dst=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523 > perm=user:group:rwxr-x--- > 4. 2014-08-05 05:05:13,004 INFO FSNamesystem.audit: ugi=* ip=ipaddress2 > cmd=rename > src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_1 > dst=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523 > perm=user:group:rwxr-x--- > After consulting our Hadoop core team, we was pointed out some HCat code does > not participating in the two-phase commit protocol, for example in > FileRecordWriterContainer.close(): > {code} > for (Map.Entry<String, org.apache.hadoop.mapred.OutputCommitter> > entry : baseDynamicCommitters.entrySet()) { > org.apache.hadoop.mapred.TaskAttemptContext currContext = > dynamicContexts.get(entry.getKey()); > OutputCommitter baseOutputCommitter = entry.getValue(); > if (baseOutputCommitter.needsTaskCommit(currContext)) { > baseOutputCommitter.commitTask(currContext); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)