[jira] [Updated] (HIVE-8291) ACID : Reading from partitioned bucketed tables has high overhead, 50% of time is spent in OrcInputFormat.getReader

Mostafa Mokhtar (JIRA) Mon, 29 Sep 2014 11:58:22 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mostafa Mokhtar updated HIVE-8291:
----------------------------------
    Description: 
Reading from bucketed partitioned tables has significantly higher overhead 
compared to non-bucketed non-partitioned files.


50% of the time is spent in these two lines of code in 
OrcInputFormate.getReader()
{code}
    String txnString = conf.get(ValidTxnList.VALID_TXNS_KEY,
                                Long.MAX_VALUE + ":");
    ValidTxnList validTxnList = new ValidTxnListImpl(txnString);
{code}

{code}
Stack Trace     Sample Count    Percentage(%)
hive.ql.exec.tez.MapRecordSource.pushRecord()   2,981   87.215
        org.apache.tez.mapreduce.lib.MRReaderMapred.next()      2,002   58.572
        
mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(Object,
 Object)      2,002   58.572
                        
mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader()
    1,984   58.046
                hive.ql.io.HiveInputFormat.getRecordReader(InputSplit, JobConf, 
Reporter)       1,983   58.016
                        
hive.ql.io.orc.OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter)    
1,891   55.325
                        hive.ql.io.orc.OrcInputFormat.getReader(InputSplit, 
AcidInputFormat$Options)    1,723   50.41
                                hive.common.ValidTxnListImpl.<init>(String)     
934     27.326
                            conf.Configuration.get(String, String)      621     
18.169
 {code}



  was:
Reading from bucketed partitioned tables has significantly higher overhead 
compared to non-bucketed non-partitioned files.


50% of the time is spent in these two lines of code in 
OrcInputFormate.getReader()
{code}
    String txnString = conf.get(ValidTxnList.VALID_TXNS_KEY,
                                Long.MAX_VALUE + ":");
    ValidTxnList validTxnList = new ValidTxnListImpl(txnString);
{code}

{code}
Stack Trace     Sample Count    Percentage(%)
hive.ql.exec.tez.MapRecordSource.pushRecord()   2,981   87.215
        org.apache.tez.mapreduce.lib.MRReaderMapred.next()      2,002   58.572
        
mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(Object,
 Object)      2,002   58.572
                        
mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader()
    1,984   58.046
                hive.ql.io.HiveInputFormat.getRecordReader(InputSplit, JobConf, 
Reporter)       1,983   58.016
                        
hive.ql.io.orc.OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter)    
1,891   55.325
                        hive.ql.io.orc.OrcInputFormat.getReader(InputSplit, 
AcidInputFormat$Options)    1,723   50.41
                                hive.common.ValidTxnListImpl.<init>(String)     
934     27.326
                            conf.Configuration.get(String, String)      621     
18.169
 {code}

Another 20% of the profile is spent in MapOperator.cleanUpInputFileChangedOp

5% the CPU in 
{code}
 Path onepath = normalizePath(onefile);
{code}

And 
15% the CPU in 
{code}
 onepath.toUri().relativize(fpath.toUri()).equals(fpath.toUri());
{code}

>From the profiler 
{code}
Stack Trace     Sample Count    Percentage(%)
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(Object)   978     
28.613
   org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(Writable)    
978     28.613
      org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged() 866     
25.336
         org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp() 
866     25.336
            java.net.URI.relativize(URI)        655     19.163
               java.net.URI.relativize(URI, URI)        655     19.163
                  java.net.URI.normalize(String)        517     15.126
                                        java.net.URI.needsNormalization(String) 
372     10.884
                                           java.lang.String.charAt(int) 235     
6.875
                                                                          
java.net.URI.equal(String, String)    27      0.79
                                                                          
java.lang.StringBuilder.toString()    1       0.029
                                                                          
java.lang.StringBuilder.<init>()      1       0.029
                                                                          
java.lang.StringBuilder.append(String)        1       0.029
                                                                
org.apache.hadoop.hive.ql.exec.MapOperator.normalizePath(String)        167     
4.886
                                                                   
org.apache.hadoop.fs.Path.<init>(String)     162     4.74
                                                                          
org.apache.hadoop.fs.Path.initialize(String, String, String, String)  162     
4.74
        org.apache.hadoop.fs.Path.normalizePath(String, String) 97      2.838
           org.apache.commons.lang.StringUtils.replace(String, String, String)  
97      2.838
                  org.apache.commons.lang.StringUtils.replace(String, String, 
String, int)      97      2.838
                         java.lang.String.indexOf(String, int)  97      2.838
                java.net.URI.<init>(String, String, String, String, String)     
65      1.902
{code}



> ACID : Reading from partitioned bucketed tables has high overhead, 50% of 
> time is spent in OrcInputFormat.getReader
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8291
>                 URL: https://issues.apache.org/jira/browse/HIVE-8291
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.14.0
>         Environment: cn105
>            Reporter: Mostafa Mokhtar
>            Assignee: Owen O'Malley
>             Fix For: 0.14.0
>
>         Attachments: 2014_09_28_16_48_48.jfr
>
>
> Reading from bucketed partitioned tables has significantly higher overhead 
> compared to non-bucketed non-partitioned files.
> 50% of the time is spent in these two lines of code in 
> OrcInputFormate.getReader()
> {code}
>     String txnString = conf.get(ValidTxnList.VALID_TXNS_KEY,
>                                 Long.MAX_VALUE + ":");
>     ValidTxnList validTxnList = new ValidTxnListImpl(txnString);
> {code}
> {code}
> Stack Trace   Sample Count    Percentage(%)
> hive.ql.exec.tez.MapRecordSource.pushRecord() 2,981   87.215
>       org.apache.tez.mapreduce.lib.MRReaderMapred.next()      2,002   58.572
>       
> mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(Object,
>  Object)      2,002   58.572
>                       
> mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader()
>     1,984   58.046
>               hive.ql.io.HiveInputFormat.getRecordReader(InputSplit, JobConf, 
> Reporter)       1,983   58.016
>                       
> hive.ql.io.orc.OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter)  
>   1,891   55.325
>                       hive.ql.io.orc.OrcInputFormat.getReader(InputSplit, 
> AcidInputFormat$Options)    1,723   50.41
>                               hive.common.ValidTxnListImpl.<init>(String)     
> 934     27.326
>                             conf.Configuration.get(String, String)    621     
> 18.169
>  {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8291) ACID : Reading from partitioned bucketed tables has high overhead, 50% of time is spent in OrcInputFormat.getReader

Reply via email to