[ https://issues.apache.org/jira/browse/HIVE-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mostafa Mokhtar updated HIVE-8291: ---------------------------------- Description: Reading from bucketed partitioned tables has significantly higher overhead compared to non-bucketed non-partitioned files. 50% of the time is spent in these two lines of code in OrcInputFormate.getReader() {code} String txnString = conf.get(ValidTxnList.VALID_TXNS_KEY, Long.MAX_VALUE + ":"); ValidTxnList validTxnList = new ValidTxnListImpl(txnString); {code} {code} Stack Trace Sample Count Percentage(%) hive.ql.exec.tez.MapRecordSource.pushRecord() 2,981 87.215 org.apache.tez.mapreduce.lib.MRReaderMapred.next() 2,002 58.572 mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(Object, Object) 2,002 58.572 mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader() 1,984 58.046 hive.ql.io.HiveInputFormat.getRecordReader(InputSplit, JobConf, Reporter) 1,983 58.016 hive.ql.io.orc.OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter) 1,891 55.325 hive.ql.io.orc.OrcInputFormat.getReader(InputSplit, AcidInputFormat$Options) 1,723 50.41 hive.common.ValidTxnListImpl.<init>(String) 934 27.326 conf.Configuration.get(String, String) 621 18.169 {code} was: Reading from bucketed partitioned tables has significantly higher overhead compared to non-bucketed non-partitioned files. 50% of the time is spent in these two lines of code in OrcInputFormate.getReader() {code} String txnString = conf.get(ValidTxnList.VALID_TXNS_KEY, Long.MAX_VALUE + ":"); ValidTxnList validTxnList = new ValidTxnListImpl(txnString); {code} {code} Stack Trace Sample Count Percentage(%) hive.ql.exec.tez.MapRecordSource.pushRecord() 2,981 87.215 org.apache.tez.mapreduce.lib.MRReaderMapred.next() 2,002 58.572 mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(Object, Object) 2,002 58.572 mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader() 1,984 58.046 hive.ql.io.HiveInputFormat.getRecordReader(InputSplit, JobConf, Reporter) 1,983 58.016 hive.ql.io.orc.OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter) 1,891 55.325 hive.ql.io.orc.OrcInputFormat.getReader(InputSplit, AcidInputFormat$Options) 1,723 50.41 hive.common.ValidTxnListImpl.<init>(String) 934 27.326 conf.Configuration.get(String, String) 621 18.169 {code} Another 20% of the profile is spent in MapOperator.cleanUpInputFileChangedOp 5% the CPU in {code} Path onepath = normalizePath(onefile); {code} And 15% the CPU in {code} onepath.toUri().relativize(fpath.toUri()).equals(fpath.toUri()); {code} >From the profiler {code} Stack Trace Sample Count Percentage(%) org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(Object) 978 28.613 org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(Writable) 978 28.613 org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged() 866 25.336 org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp() 866 25.336 java.net.URI.relativize(URI) 655 19.163 java.net.URI.relativize(URI, URI) 655 19.163 java.net.URI.normalize(String) 517 15.126 java.net.URI.needsNormalization(String) 372 10.884 java.lang.String.charAt(int) 235 6.875 java.net.URI.equal(String, String) 27 0.79 java.lang.StringBuilder.toString() 1 0.029 java.lang.StringBuilder.<init>() 1 0.029 java.lang.StringBuilder.append(String) 1 0.029 org.apache.hadoop.hive.ql.exec.MapOperator.normalizePath(String) 167 4.886 org.apache.hadoop.fs.Path.<init>(String) 162 4.74 org.apache.hadoop.fs.Path.initialize(String, String, String, String) 162 4.74 org.apache.hadoop.fs.Path.normalizePath(String, String) 97 2.838 org.apache.commons.lang.StringUtils.replace(String, String, String) 97 2.838 org.apache.commons.lang.StringUtils.replace(String, String, String, int) 97 2.838 java.lang.String.indexOf(String, int) 97 2.838 java.net.URI.<init>(String, String, String, String, String) 65 1.902 {code} > ACID : Reading from partitioned bucketed tables has high overhead, 50% of > time is spent in OrcInputFormat.getReader > ------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-8291 > URL: https://issues.apache.org/jira/browse/HIVE-8291 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.14.0 > Environment: cn105 > Reporter: Mostafa Mokhtar > Assignee: Owen O'Malley > Fix For: 0.14.0 > > Attachments: 2014_09_28_16_48_48.jfr > > > Reading from bucketed partitioned tables has significantly higher overhead > compared to non-bucketed non-partitioned files. > 50% of the time is spent in these two lines of code in > OrcInputFormate.getReader() > {code} > String txnString = conf.get(ValidTxnList.VALID_TXNS_KEY, > Long.MAX_VALUE + ":"); > ValidTxnList validTxnList = new ValidTxnListImpl(txnString); > {code} > {code} > Stack Trace Sample Count Percentage(%) > hive.ql.exec.tez.MapRecordSource.pushRecord() 2,981 87.215 > org.apache.tez.mapreduce.lib.MRReaderMapred.next() 2,002 58.572 > > mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(Object, > Object) 2,002 58.572 > > mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader() > 1,984 58.046 > hive.ql.io.HiveInputFormat.getRecordReader(InputSplit, JobConf, > Reporter) 1,983 58.016 > > hive.ql.io.orc.OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter) > 1,891 55.325 > hive.ql.io.orc.OrcInputFormat.getReader(InputSplit, > AcidInputFormat$Options) 1,723 50.41 > hive.common.ValidTxnListImpl.<init>(String) > 934 27.326 > conf.Configuration.get(String, String) 621 > 18.169 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)