[ 
https://issues.apache.org/jira/browse/HIVE-23014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058005#comment-17058005
 ] 

Panagiotis Garefalakis commented on HIVE-23014:
-----------------------------------------------

Thanks for the extra details [~petertoth] 
I have a feeling that the included columns Options is not properly set for the 
OrcReader and it ends up reading the whole dataset.
For instance, for 200columns the runtime is 2x compared to reading 100 columns 
and in a similar manner reading 300columns is 3x (while it should read just 1 
column each time).

I can also see that there are some major changes in getIncludedColumns method 
in 2.3.6 – 
[https://github.com/apache/hive/blob/2c2fdd524e8783f6e1f3ef15281cc2d5ed08728f/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L396]

cc: [~gopalv] [~ashutoshc] [~omalley]

> ORC reading performance
> -----------------------
>
>                 Key: HIVE-23014
>                 URL: https://issues.apache.org/jira/browse/HIVE-23014
>             Project: Hive
>          Issue Type: Bug
>          Components: ORC
>    Affects Versions: 2.3.6
>            Reporter: Peter Toth
>            Priority: Major
>         Attachments: OrcReadBenchmark-results.txt.hive-1.2.1, 
> OrcReadBenchmark-results.txt.hive-2.3.6
>
>
> Spark 3 adds support for using Hive 2.3.6 besides the old Hive 1.2.1 version. 
> Some of the ORC reading benchmark shows that there is a huge performance 
> difference in ORC reading between the 2 versions. I measured that 
> {{org.apache.hadoop.hive.ql.io.orc.ReaderImpl}} in hive-exec-2.3.6-core.jar 
> is ~3-5 times slower than in hive-exec-1.2.1.spark2.jar.
> I'm not sure if more recent Hive versions still suffer from this performance 
> regression.
> Please see some details here: SPARK-30565



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to