[ https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388925#comment-14388925 ]
Mithun Radhakrishnan commented on HIVE-9845: -------------------------------------------- Bah, finally. Unrelated test-failures. > HCatSplit repeats information making input split data size huge > --------------------------------------------------------------- > > Key: HIVE-9845 > URL: https://issues.apache.org/jira/browse/HIVE-9845 > Project: Hive > Issue Type: Bug > Components: HCatalog > Reporter: Rohini Palaniswamy > Assignee: Mithun Radhakrishnan > Attachments: HIVE-9845.1.patch, HIVE-9845.3.patch > > > Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which > has even triple the number of splits(100K+ splits and tasks) does not hit > that issue. > {code} > HCatBaseInputFormat.java: > //Call getSplit on the InputFormat, create an > //HCatSplit for each underlying split > //NumSplits is 0 for our purposes > org.apache.hadoop.mapred.InputSplit[] baseSplits = > inputFormat.getSplits(jobConf, 0); > for(org.apache.hadoop.mapred.InputSplit split : baseSplits) { > splits.add(new HCatSplit( > partitionInfo, > split,allCols)); > } > {code} > Each hcatSplit duplicates partition schema and table schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)