> On April 23, 2013, 7:33 p.m., Eric Hanson wrote: > > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReader.java, line 47 > > <https://reviews.apache.org/r/10712/diff/1/?file=283263#file283263line47> > > > > the previous argument and the return value are always of class > > VectorizedRowBatch so why not declare them to be VectorizedRowBatch instead > > of Object?
I wanted to keep the interface same for both next and nextBatch, but it makes more sense to change it to VectorizedRowBatch. > On April 23, 2013, 7:33 p.m., Eric Hanson wrote: > > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 187 > > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line187> > > > > if noNulls is false, that means there are nulls, so you can't set > > noNulls back to true, can you? Change the logic a bit based on Scott's comments.It should look good now. > On April 23, 2013, 7:33 p.m., Eric Hanson wrote: > > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 336 > > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line336> > > > > You should never have to create a new column vector here because there > > should always be one passed down into this code as part of the batch. > > > > Consider assuming there will always be a vector. The code will just > > fail if there isn't one. The code is flexible enough to work either ways. If a vector is passed in it use it else will create a new one. This is the pattern followed in ORC reader and would like to keep it as such. > On April 23, 2013, 7:33 p.m., Eric Hanson wrote: > > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 342 > > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line342> > > > > It appears you are replacing one isNull array (the one passed in) with > > another. We should be writing null information into the same Boolean array > > over and over. Am I missing something? Changed a logic a bit and now the vector is being reused. > On April 23, 2013, 7:33 p.m., Eric Hanson wrote: > > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 393 > > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line393> > > > > this code for IntTreeReader appears identical to the ShortTreeReader > > case. Can the code be shared, e.g. by inheriting from super-class? > > > > Since this is a vector, consider calling it nextVector Creating one more class to abstract out the code creates two layers of redirection. Given that nextBatch method in TreeReaders have a dependency on the super class I would rather leave it as such. I will change the function name from nextBatch to nextVector as it makes more sense. > On April 23, 2013, 7:33 p.m., Eric Hanson wrote: > > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 452 > > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line452> > > > > same comment -- can the code be shared for all int types? As as above. > On April 23, 2013, 7:33 p.m., Eric Hanson wrote: > > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java, line 531 > > <https://reviews.apache.org/r/10712/diff/1/?file=283264#file283264line531> > > > > I think this could incorrectly decide something isRepeating if there > > are nulls, the way the code is now. But using NaN for nulls should prevent > > this. > > > > I think this isRepeating check is probably worth it for performance but > > it might not really pay off. It might be worth a performance test sometime > > in the future to determine this. Yep we can do that once the end-end pipeline is build. I have fixed the NaN case and now isRepeating should be set correctly. > On April 23, 2013, 7:33 p.m., Eric Hanson wrote: > > ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java, > > line 105 > > <https://reviews.apache.org/r/10712/diff/1/?file=283266#file283266line105> > > > > Should create the batch up here, not use a null batch. > > > > Or at least have a second test that creates the batch outside and > > passes it down. This is fine as the next call to nextBatch passes a created batch down. > On April 23, 2013, 7:33 p.m., Eric Hanson wrote: > > ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java, > > line 125 > > <https://reviews.apache.org/r/10712/diff/1/?file=283266#file283266line125> > > > > should add another test with some null values Change the logic of the same test to pass in null. - Sarvesh ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/10712/#review19568 ----------------------------------------------------------- On April 24, 2013, 9:53 p.m., Sarvesh Sakalanaga wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/10712/ > ----------------------------------------------------------- > > (Updated April 24, 2013, 9:53 p.m.) > > > Review request for hive. > > > Description > ------- > > The patch contains changes to ORC reader to return a batch of rows instead of > a row. A new method called nextBatch() is added to ORC reader and tree > readers of ORC. Currently only int,long,short,double,float,string and struct > support batch processing. > > > This addresses bug HIVE-4370. > https://issues.apache.org/jira/browse/HIVE-4370 > > > Diffs > ----- > > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/BytesColumnVector.java > 246170d > ql/src/java/org/apache/hadoop/hive/ql/io/orc/DynamicByteArray.java fc4e53b > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReader.java 05240ce > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java d044cd8 > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReader.java > 2825c64 > ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/10712/diff/ > > > Testing > ------- > > > Thanks, > > Sarvesh Sakalanaga > >