Re: Review Request: Change ORC tree readers to return batches of rows instead of a row

Sarvesh Sakalanaga Wed, 24 Apr 2013 14:53:32 -0700


> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReader.java, line 47
> > <https://reviews.apache.org/r/10712/diff/1/?file=283263#file283263line47>
> >
> >     the previous argument and the return value are always of class 
> > VectorizedRowBatch so why not declare them to be VectorizedRowBatch instead 
> > of Object?


I wanted to keep the interface same for both next and nextBatch, but it makes 
more sense to change it to VectorizedRowBatch. 


> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 187
> > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line187>
> >
> >     if noNulls is false, that means there are nulls, so you can't set 
> > noNulls back to true, can you?

Change the logic a bit based on Scott's comments.It should look good now.


> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 336
> > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line336>
> >
> >     You should never have to create a new column vector here because there 
> > should always be one passed down into this code as part of the batch.
> >     
> >     Consider assuming there will always be a vector. The code will just 
> > fail if there isn't one.

The code is flexible enough to work either ways. If a vector is passed in it 
use it else will create a new one. This is the pattern followed in ORC reader 
and would like to keep it as such.


> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 342
> > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line342>
> >
> >     It appears you are replacing one isNull array (the one passed in) with 
> > another. We should be writing null information into the same Boolean array 
> > over and over. Am I missing something?

Changed a logic a bit and now the vector is being reused.


> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 393
> > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line393>
> >
> >     this code for IntTreeReader appears identical to the ShortTreeReader 
> > case. Can the code be shared, e.g. by inheriting from super-class?
> >     
> >     Since this is a vector, consider calling it nextVector

Creating one more class to abstract out the code creates two layers of 
redirection. Given that nextBatch method in TreeReaders have a dependency on 
the super class I would rather leave it as such.  

I will change the function name from nextBatch to nextVector as it makes more 
sense.


> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 452
> > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line452>
> >
> >     same comment -- can the code be shared for all int types?

As as above.


> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 531
> > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line531>
> >
> >     I think this could incorrectly decide something isRepeating if there 
> > are nulls, the way the code is now. But using NaN for nulls should prevent 
> > this. 
> >     
> >     I think this isRepeating check is probably worth it for performance but 
> > it might not really pay off. It might be worth a performance test sometime 
> > in the future to determine this.

Yep we can do that once the end-end pipeline is build. I have fixed the NaN 
case and now isRepeating should be set correctly.


> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java, 
> > line 105
> > <https://reviews.apache.org/r/10712/diff/1/?file=283266#file283266line105>
> >
> >     Should create the batch up here, not use a null batch. 
> >     
> >     Or at least have a second test that creates the batch outside and 
> > passes it down.

This is fine as the next call to nextBatch passes a created batch down.


> On April 23, 2013, 7:33 p.m., Eric Hanson wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java, 
> > line 125
> > <https://reviews.apache.org/r/10712/diff/1/?file=283266#file283266line125>
> >
> >     should add another test with some null values

Change the logic of the same test to pass in null.


- Sarvesh


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10712/#review19568
-----------------------------------------------------------


On April 24, 2013, 9:53 p.m., Sarvesh Sakalanaga wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10712/
> -----------------------------------------------------------
> 
> (Updated April 24, 2013, 9:53 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Description
> -------
> 
> The patch contains changes to ORC reader to return a batch of rows instead of 
> a row. A new method called nextBatch() is added to ORC reader and tree 
> readers of ORC. Currently only int,long,short,double,float,string and struct 
> support batch processing.
> 
> 
> This addresses bug HIVE-4370.
>     https://issues.apache.org/jira/browse/HIVE-4370
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/vector/BytesColumnVector.java 
> 246170d 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/DynamicByteArray.java fc4e53b 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReader.java 05240ce 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java d044cd8 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReader.java 
> 2825c64 
>   ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/10712/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sarvesh Sakalanaga
> 
>

Re: Review Request: Change ORC tree readers to return batches of rows instead of a row

Reply via email to