[ https://issues.apache.org/jira/browse/HIVE-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187361#comment-14187361 ]
Sushanth Sowmyan commented on HIVE-8394: ---------------------------------------- I'll admit to the same distaste to using a Singleton to store state like this - we've had similar problems with HCatContext in the past, but I agree with your assertion that that seems to be the only real way to handle this issue. Now, that said, the attached file HIVE-8394.1.patch includes a TaskCommitterContextRegistry.discardCleanupFor and does not ever call it. I assume that's what you mean with your comment on needing to add a finally block? Also, yes, the patch in the current form has an issue with multiple HCatStorers - do you have an updated patch with both these issues resolved? > HIVE-7803 doesn't handle Pig MultiQuery, can cause data-loss. > ------------------------------------------------------------- > > Key: HIVE-8394 > URL: https://issues.apache.org/jira/browse/HIVE-8394 > Project: Hive > Issue Type: Bug > Components: HCatalog > Affects Versions: 0.12.0, 0.14.0, 0.13.1 > Reporter: Mithun Radhakrishnan > Assignee: Mithun Radhakrishnan > Priority: Critical > Attachments: HIVE-8394.1.patch > > > We've found situations in production where Pig queries using {{HCatStorer}}, > dynamic partitioning and {{opt.multiquery=true}} that produce partitions in > the output table, but the corresponding directories have no data files (in > spite of Pig reporting non-zero records written to HDFS). I don't yet have a > distilled test-case for this. > Here's the code from FileOutputCommitterContainer after HIVE-7803: > {code:java|title=FileOutputCommitterContainer.java|borderStyle=dashed|titleBGColor=#F7D6C1|bgColor=#FFFFCE} > @Override > public void commitTask(TaskAttemptContext context) throws IOException { > String jobInfoStr = > context.getConfiguration().get(FileRecordWriterContainer.DYN_JOBINFO); > if (!dynamicPartitioningUsed) { > //See HCATALOG-499 > FileOutputFormatContainer.setWorkOutputPath(context); > > getBaseOutputCommitter().commitTask(HCatMapRedUtil.createTaskAttemptContext(context)); > } else if (jobInfoStr != null) { > ArrayList<String> jobInfoList = > (ArrayList<String>)HCatUtil.deserialize(jobInfoStr); > org.apache.hadoop.mapred.TaskAttemptContext currTaskContext = > HCatMapRedUtil.createTaskAttemptContext(context); > for (String jobStr : jobInfoList) { > OutputJobInfo localJobInfo = > (OutputJobInfo)HCatUtil.deserialize(jobStr); > FileOutputCommitter committer = new FileOutputCommitter(new > Path(localJobInfo.getLocation()), currTaskContext); > committer.commitTask(currTaskContext); > } > } > } > {code} > The serialized jobInfoList can't be retrieved, and hence the commit never > completes. This is because Pig's MapReducePOStoreImpl deliberately clones > both the TaskAttemptContext and the contained Configuration instance, thus > separating the Configuration instances passed to > {{FileOutputCommitterContainer::commitTask()}} and > {{FileRecordWriterContainer::close()}}. Anything set by the RecordWriter is > unavailable to the Committer. > One approach would have been to store state in the FileOutputFormatContainer. > But that won't work since this is constructed via reflection in > HCatOutputFormat (itself constructed via reflection by PigOutputFormat via > HCatStorer). There's no guarantee that the instance is preserved. > My only recourse seems to be to use a Singleton to store shared state. I'm > loath to indulge in this brand of shenanigans. (Statics and container-reuse > in Tez might not play well together, for instance.) It might work if we're > careful about tearing down the singleton. > Any other ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)