Re: Review Request: Change ORC tree readers to return batches of rows instead of a row

Sarvesh Sakalanaga Wed, 24 Apr 2013 17:59:39 -0700


> On April 24, 2013, 11:01 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReader.java, 
> > line 97
> > <https://reviews.apache.org/r/10712/diff/2/?file=284237#file284237line97>
> >
> >     if there are no nulls in a strip or split for a column, we should be 
> > able to do a fast code path that doesn't need this check and if-else
> >     
> >     I haven't see noNulls get set anywhere. What is the plan for setting 
> > noNulls as an optimization? That has a big performance impact in QE (about 
> > 30% time savings for filters and arithmetic)
> >


This is being set in the parent class TreeReader::nextVector


> On April 24, 2013, 11:01 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 
> > 1486
> > <https://reviews.apache.org/r/10712/diff/2/?file=284236#file284236line1486>
> >
> >     I don't understand this. map and struct are not supported yet, so I 
> > think this should be unimplemented.

A table is represented as struct in ORC so this is required.


> On April 24, 2013, 11:01 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 
> > 1029
> > <https://reviews.apache.org/r/10712/diff/2/?file=284236#file284236line1029>
> >
> >     The plan was to not support struct yet, but later, to support a field 
> > of a struct just like it was a regular column. STruct field access would 
> > just be a naming convention.
> >     
> >     A query might not access every field of a struct. This reads every 
> > field of the struct.
> >     
> >     I think probably we should leave this unimplemented and then come back 
> > and do it later using the naming-convention technique.

A table is represented as struct in ORC so this is required. We are not reading 
all the columns of the table/struct, ORC record reader reads only the columns 
that are required. RecordReaderImpl::readStrip()in ORC is the method that does 
this.


> On April 24, 2013, 11:01 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 173
> > <https://reviews.apache.org/r/10712/diff/2/?file=284236#file284236line173>
> >
> >     I recommend this method take and return a ColumnVector instead of an 
> > Object since I don't think it would every make sense to note take a 
> > ColumnVector subtype
> >     
> >     this applies to all nextVector methods

The reason this method is returning an object is because for struct tree 
readers, the return value is ColumnVector[] and not ColumnVector. Similarly 
each of the complex data type readers can opt to return different objects.


> On April 24, 2013, 11:01 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 
> > 1475
> > <https://reviews.apache.org/r/10712/diff/2/?file=284236#file284236line1475>
> >
> >     put a javadoc comment describing method

The javadoc for this method is at org.apache.hadoop.hive.ql.io.orc.RecordReader.


- Sarvesh


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10712/#review19674
-----------------------------------------------------------


On April 24, 2013, 9:53 p.m., Sarvesh Sakalanaga wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10712/
> -----------------------------------------------------------
> 
> (Updated April 24, 2013, 9:53 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Description
> -------
> 
> The patch contains changes to ORC reader to return a batch of rows instead of 
> a row. A new method called nextBatch() is added to ORC reader and tree 
> readers of ORC. Currently only int,long,short,double,float,string and struct 
> support batch processing.
> 
> 
> This addresses bug HIVE-4370.
>     https://issues.apache.org/jira/browse/HIVE-4370
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/vector/BytesColumnVector.java 
> 246170d 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/DynamicByteArray.java fc4e53b 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReader.java 05240ce 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java d044cd8 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReader.java 
> 2825c64 
>   ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/10712/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sarvesh Sakalanaga
> 
>

Re: Review Request: Change ORC tree readers to return batches of rows instead of a row

Reply via email to