[ https://issues.apache.org/jira/browse/HIVE-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194101#comment-14194101 ]
Sushanth Sowmyan commented on HIVE-8394: ---------------------------------------- Good to know - I'll go ahead and commit it in both branches. > HIVE-7803 doesn't handle Pig MultiQuery, can cause data-loss. > ------------------------------------------------------------- > > Key: HIVE-8394 > URL: https://issues.apache.org/jira/browse/HIVE-8394 > Project: Hive > Issue Type: Bug > Components: HCatalog > Affects Versions: 0.12.0, 0.14.0, 0.13.1 > Reporter: Mithun Radhakrishnan > Assignee: Mithun Radhakrishnan > Priority: Critical > Fix For: 0.14.0 > > Attachments: HIVE-8394.1.patch, HIVE-8394.2.patch, HIVE-8394.3.patch, > HIVE-8394.4.patch > > > We've found situations in production where Pig queries using {{HCatStorer}}, > dynamic partitioning and {{opt.multiquery=true}} that produce partitions in > the output table, but the corresponding directories have no data files (in > spite of Pig reporting non-zero records written to HDFS). I don't yet have a > distilled test-case for this. > Here's the code from FileOutputCommitterContainer after HIVE-7803: > {code:java|title=FileOutputCommitterContainer.java|borderStyle=dashed|titleBGColor=#F7D6C1|bgColor=#FFFFCE} > @Override > public void commitTask(TaskAttemptContext context) throws IOException { > String jobInfoStr = > context.getConfiguration().get(FileRecordWriterContainer.DYN_JOBINFO); > if (!dynamicPartitioningUsed) { > //See HCATALOG-499 > FileOutputFormatContainer.setWorkOutputPath(context); > > getBaseOutputCommitter().commitTask(HCatMapRedUtil.createTaskAttemptContext(context)); > } else if (jobInfoStr != null) { > ArrayList<String> jobInfoList = > (ArrayList<String>)HCatUtil.deserialize(jobInfoStr); > org.apache.hadoop.mapred.TaskAttemptContext currTaskContext = > HCatMapRedUtil.createTaskAttemptContext(context); > for (String jobStr : jobInfoList) { > OutputJobInfo localJobInfo = > (OutputJobInfo)HCatUtil.deserialize(jobStr); > FileOutputCommitter committer = new FileOutputCommitter(new > Path(localJobInfo.getLocation()), currTaskContext); > committer.commitTask(currTaskContext); > } > } > } > {code} > The serialized jobInfoList can't be retrieved, and hence the commit never > completes. This is because Pig's MapReducePOStoreImpl deliberately clones > both the TaskAttemptContext and the contained Configuration instance, thus > separating the Configuration instances passed to > {{FileOutputCommitterContainer::commitTask()}} and > {{FileRecordWriterContainer::close()}}. Anything set by the RecordWriter is > unavailable to the Committer. > One approach would have been to store state in the FileOutputFormatContainer. > But that won't work since this is constructed via reflection in > HCatOutputFormat (itself constructed via reflection by PigOutputFormat via > HCatStorer). There's no guarantee that the instance is preserved. > My only recourse seems to be to use a Singleton to store shared state. I'm > loath to indulge in this brand of shenanigans. (Statics and container-reuse > in Tez might not play well together, for instance.) It might work if we're > careful about tearing down the singleton. > Any other ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)