Re: Approaching Vectorized Reading in Iceberg ..

2019-07-31 Thread Gautam
gt;>>>>>>>>> - `VectorizedSparkParquetReaders` contains the visitor >>>>>>>>>> implementations to map Parquet types to appropriate value readers. I >>>>>>>>>> implemented the struct visitor so that the root schem

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-31 Thread Anjali Norwood
roup into a single Arrow Field Vector. this i'd imagine will require >>>>>>>>> tuning for right batch sizing but i'v gone with one batch per >>>>>>>>> rowgroup for >>>>>>>>> now. >>>>>>>>> - A

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-31 Thread Gautam
e with one batch per >>>>>>>>> rowgroup for >>>>>>>>> now. >>>>>>>>> - Arrow Field Vectors are wrapped using `ArrowColumnVector` which >>>>>>>>> is Spark's ColumnVector implementation backed by

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-31 Thread Gautam
nternalRowReader` >>>>>>>> which maps Structs to Columnar Batches. This allows us to have nested >>>>>>>> structs where each level of nesting would be a nested columnar batch. >>>>>>>>

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-30 Thread Samarth Jain
#x27;v added value readers for all supported primitive types listed >>>>>>> in `AvroDataTest`. There's a corresponding test for vectorized reader >>>>>>> under >>>>>>> `TestSparkParquetVectorizedReader` >>>>>>>

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-26 Thread Gautam
>> under >>>>>> `TestSparkParquetVectorizedReader` >>>>>> - I haven't fixed all the Checkstyle errors so you will have to turn >>>>>> checkstyle off in build.gradle. Also skip tests while building.. sorry! >>>>>> :-( >>>>>&g

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-26 Thread Daniel Weeks
not used. This was from my previous impl of Vectorization. I'v >>>>> kept >>>>> it around to compare performance. >>>>> >>>>> Lemme know what folks think of the approach. I'm getting this working >>>>> for ou

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-26 Thread Daniel Weeks
gt;>> >>>> >>>> >>>> >>>> [1] - >>>> https://github.com/prodeezy/incubator-iceberg/tree/issue-9-support-arrow-based-reading-WIP >>>> [2] - >>>> https://github.com/apache/incubator-iceberg/compare/master...prodeezy:iss

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-26 Thread Gautam
gt;>> [1] - >>> https://github.com/prodeezy/incubator-iceberg/tree/issue-9-support-arrow-based-reading-WIP >>> [2] - >>> https://github.com/apache/incubator-iceberg/compare/master...prodeezy:issue-9-support-arrow-based-reading-WIP >>> [3] - >>> https://github.com/apache/incu

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-26 Thread Gautam
gt; [3] - >> https://github.com/apache/incubator-iceberg/blob/72e3485510e9cbec05dd30e2e7ce5d03071f400d/core/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java >> >> >> On Mon, Jul 22, 2019 at 2:33 PM Gautam wrote: >> >>> Will do. Doing a bit of ho

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-24 Thread Daniel Weeks
or-iceberg/tree/issue-9-support-arrow-based-reading-WIP >>> [2] - >>> https://github.com/apache/incubator-iceberg/compare/master...prodeezy:issue-9-support-arrow-based-reading-WIP >>> [3] - >>> https://github.com/apache/incubator-iceberg/blob/72e3485510e9cbe

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-24 Thread Gautam
/arrow/ArrowSchemaUtil.java >> >> >> On Mon, Jul 22, 2019 at 2:33 PM Gautam wrote: >> >>> Will do. Doing a bit of housekeeping on the code and also adding more >>> primitive type support. >>> >>> On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-24 Thread Ryan Blue
>> primitive type support. >> >> On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah wrote: >> >>> Would it be possible to put the work in progress code in open source? >>> >>> >>> >>> *From: *Gautam >>> *Reply-To: *"dev@ice

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-23 Thread Gautam
> On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah wrote: > >> Would it be possible to put the work in progress code in open source? >> >> >> >> *From: *Gautam >> *Reply-To: *"dev@iceberg.apache.org" >> *Date: *Monday, July 22, 2019 at 9:46 AM >

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-22 Thread Gautam
rg.apache.org" > *Date: *Monday, July 22, 2019 at 9:46 AM > *To: *Daniel Weeks > *Cc: *Ryan Blue , Iceberg Dev List < > dev@iceberg.apache.org> > *Subject: *Re: Approaching Vectorized Reading in Iceberg .. > > > > That would be great! > > > > On Mon, Jul

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-22 Thread Matt Cheah
Would it be possible to put the work in progress code in open source? From: Gautam Reply-To: "dev@iceberg.apache.org" Date: Monday, July 22, 2019 at 9:46 AM To: Daniel Weeks Cc: Ryan Blue , Iceberg Dev List Subject: Re: Approaching Vectorized Reading in Iceberg .. That woul

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-22 Thread Gautam
That would be great! On Mon, Jul 22, 2019 at 9:12 AM Daniel Weeks wrote: > Hey Gautam, > > We also have a couple people looking into vectorized reading (into Arrow > memory). I think it would be good for us to get together and see if we can > collaborate on a common approach for this. > > I'll

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-22 Thread Daniel Weeks
Hey Gautam, We also have a couple people looking into vectorized reading (into Arrow memory). I think it would be good for us to get together and see if we can collaborate on a common approach for this. I'll reach out directly and see if we can get together. -Dan On Sun, Jul 21, 2019 at 10:35

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-21 Thread Gautam
Figured this out. I'm returning ColumnarBatch iterator directly without projection with schema set appropriately in `readSchema() `.. the empty result was due to valuesRead not being set correctly on FileIterator. Did that and things are working. Will circle back with numbers soon. On Fri, Jul 19,

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-19 Thread Gautam
Hey Guys, Sorry bout the delay on this. Just got back on getting a basic working implementation in Iceberg for Vectorization on primitive types. *Here's what I have so far : * I have added `ParquetValueReader` implementations for some basic primitive types that build the respective Ar

Re: Approaching Vectorized Reading in Iceberg ..

2019-06-21 Thread Ryan Blue
The vectorized reader in Spark is only used if the schema is flat. On Fri, Jun 14, 2019 at 5:45 PM Gautam wrote: > > Agree with the approach of getting this working for primitive types only. > I'l work on a prototype assuming just primitive types for now. > > I don't think that you can mix regul

Re: Approaching Vectorized Reading in Iceberg ..

2019-06-14 Thread Gautam
Agree with the approach of getting this working for primitive types only. I'l work on a prototype assuming just primitive types for now. I don't think that you can mix regular columns and Arrow columns. It has to > be all one or the other. I was jsut curious about this coz Vanilla Spark reader (

Re: Approaching Vectorized Reading in Iceberg ..

2019-06-14 Thread Ryan Blue
Replies inline. On Fri, Jun 14, 2019 at 1:11 AM Gautam wrote: > Thanks for responding Ryan, > > Couple of follow up questions on ParquetValueReader for Arrow.. > > I'd like to start with testing Arrow out with readers for primitive type > and incrementally add in Struct/Array support, also Arrow

Re: Approaching Vectorized Reading in Iceberg ..

2019-06-14 Thread Gautam
Thanks for responding Ryan, Couple of follow up questions on ParquetValueReader for Arrow.. I'd like to start with testing Arrow out with readers for primitive type and incrementally add in Struct/Array support, also ArrowWriter [1] currently doesn't have converters for map type. How can I defaul

Re: Approaching Vectorized Reading in Iceberg ..

2019-06-14 Thread Gautam
Hey Anton, Here's the code https://github.com/prodeezy/incubator-iceberg/pull/2/files .. Mind you, it's just a proof of concept to get something going so please ignore the code design (or lack thereof :-) ). I'v attached the flat data benchmark code as well. Lemme know what you think.

Re: Approaching Vectorized Reading in Iceberg ..

2019-06-13 Thread Anton Okolnychyi
Gautam, could you also share the code for benchmarks and conversion? Thanks, Anton > On 13 Jun 2019, at 19:38, Ryan Blue wrote: > > Sounds like a good start. I think the next step is to avoid using the > ParquetReader.FileIterator and deserialize directly from TripleIterator >

Re: Approaching Vectorized Reading in Iceberg ..

2019-06-13 Thread Ryan Blue
Sounds like a good start. I think the next step is to avoid using the ParquetReader.FileIterator and deserialize directly from TripleIterator . I think the reason why this i

Re: Approaching Vectorized Reading in Iceberg ..

2019-06-12 Thread Gautam
Hey Ryan and Anton, I wanted to circle back on some findings I had after taking a first stab at this .. > There’s already a wrapper to adapt Arrow to ColumnarBatch, as well as an > iterator to read a ColumnarBatch as a sequence of InternalRow. That’s > what we want to take advantage of.. This

Re: Approaching Vectorized Reading in Iceberg ..

2019-05-28 Thread Ryan Blue
Correct. On Tue, May 28, 2019 at 3:13 PM Anton Okolnychyi wrote: > Alright, so we are talking about reading Parquet data into > ArrowRecordBatches and then exposing them as ColumnarBatches in Spark, > where Spark ColumnVectors actually wrap Arrow FieldVectors, correct? > > - Anton > > > On 28 Ma

Re: Approaching Vectorized Reading in Iceberg ..

2019-05-28 Thread Anton Okolnychyi
Alright, so we are talking about reading Parquet data into ArrowRecordBatches and then exposing them as ColumnarBatches in Spark, where Spark ColumnVectors actually wrap Arrow FieldVectors, correct? - Anton > On 28 May 2019, at 21:24, Ryan Blue wrote: > > From a performance viewpoint, this is

Re: Approaching Vectorized Reading in Iceberg ..

2019-05-28 Thread Ryan Blue
>From a performance viewpoint, this isn’t a great solution. The row by row approach will substantially hurt performance compared to the vectorized reader. I’ve seen 30% or more speed up when removing row-by-row access. So putting a row-by-row adapter in the middle of two vectorized representations

Re: Approaching Vectorized Reading in Iceberg ..

2019-05-28 Thread Owen O'Malley
On Fri, May 24, 2019 at 8:28 PM Ryan Blue wrote: > if Iceberg Reader was to wrap Arrow or ColumnarBatch behind an > Iterator[InternalRow] interface, it would still not work right? Coz it > seems to me there is a lot more going on upstream in the operator execution > path that would be needed to b

Re: Approaching Vectorized Reading in Iceberg ..

2019-05-28 Thread Anton Okolnychyi
> You’re right that the first thing that Spark does it to get each row as > InternalRow. But we still get a benefit from vectorizing the data > materialization to Arrow itself. Spark execution is not vectorized, but that > can be updated in Spark later (I think there’s a proposal). > I am not

Re: Approaching Vectorized Reading in Iceberg ..

2019-05-24 Thread Gautam
> There’s already a wrapper to adapt Arrow to ColumnarBatch, as well as an iterator to read a ColumnarBatch as a sequence of InternalRow. That’s what we want to take advantage of. You’re right that the first thing that Spark does it to get each row as InternalRow. But we still get a benefit from ve

Re: Approaching Vectorized Reading in Iceberg ..

2019-05-24 Thread Ryan Blue
if Iceberg Reader was to wrap Arrow or ColumnarBatch behind an Iterator[InternalRow] interface, it would still not work right? Coz it seems to me there is a lot more going on upstream in the operator execution path that would be needed to be done here. There’s already a wrapper to adapt Arrow to C