Re: Approaching Vectorized Reading in Iceberg ..

Gautam Fri, 26 Jul 2019 16:34:45 -0700

Got it. Commented out that module and it works. Was just curious why it
doesn't work on master branch either.


On Fri, Jul 26, 2019 at 3:49 PM Daniel Weeks <dwe...@netflix.com> wrote:

> Actually, it looks like the issue is right there in the error . . . the
> ErrorProne module is being excluded from the compile stages of the
> sub-projects here:
> https://github.com/apache/incubator-iceberg/blob/master/build.gradle#L152
>
> However, it is still being applied to the jmh tasks.  I'm not familiar
> with this module, but you can run the benchmarks by commenting it out here:
> https://github.com/apache/incubator-iceberg/blob/master/build.gradle#L167
>
> We'll need to fix the build to disable for the jmh tasks.
>
> -Dan
>
> On Fri, Jul 26, 2019 at 3:34 PM Daniel Weeks <dwe...@netflix.com> wrote:
>
>> Gautam, you need to have the jmh-core libraries available to run.  I
>> validated that PR, so I'm guessing I had it configured in my environment.
>>
>> I assume there's a way to make that available within gradle, so I'll take
>> a look.
>>
>> On Fri, Jul 26, 2019 at 2:52 PM Gautam <gautamkows...@gmail.com> wrote:
>>
>>> This fails on master too btw. Just wondering if i'm doing
>>> something wrong trying to run this.
>>>
>>> On Fri, Jul 26, 2019 at 2:24 PM Gautam <gautamkows...@gmail.com> wrote:
>>>
>>>> I'v been trying to run the jmh benchmarks bundled within the project.
>>>> I'v been running into issues with that .. have other hit this? Am I running
>>>> these incorrectly?
>>>>
>>>>
>>>> bash-3.2$ ./gradlew :iceberg-spark:jmh
>>>> -PjmhIncludeRegex=IcebergSourceFlatParquetDataFilterBenchmark
>>>> -PjmhOutputPath=benchmark/iceberg-source-flat-parquet-data-filter-benchmark-result.txt
>>>> ..
>>>> ...
>>>> > Task :iceberg-spark:jmhCompileGeneratedClasses FAILED
>>>> error: plug-in not found: ErrorProne
>>>>
>>>> FAILURE: Build failed with an exception.
>>>>
>>>>
>>>>
>>>> Is there a config/plugin I need to add to build.gradle?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jul 24, 2019 at 2:03 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> Thanks Gautam!
>>>>>
>>>>> We'll start taking a look at your code. What do you think about
>>>>> creating a branch in the Iceberg repository where we can work on improving
>>>>> it together, before merging it into master?
>>>>>
>>>>> Also, you mentioned performance comparisons. Do you have any early
>>>>> results to share?
>>>>>
>>>>> rb
>>>>>
>>>>> On Tue, Jul 23, 2019 at 3:40 PM Gautam <gautamkows...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Folks,
>>>>>>
>>>>>> I have checked in a WIP branch [1] with a working version of
>>>>>> Vectorized reads for Iceberg reader. Here's the diff  [2].
>>>>>>
>>>>>> *Implementation Notes:*
>>>>>>  - Iceberg's Reader adds a `SupportsScanColumnarBatch` mixin to
>>>>>> instruct the DataSourceV2ScanExec to use `planBatchPartitions()` instead 
>>>>>> of
>>>>>> the usual `planInputPartitions()`. It returns instances of 
>>>>>> `ColumnarBatch`
>>>>>> on each iteration.
>>>>>>  - `ArrowSchemaUtil` contains Iceberg to Arrow type conversion. This
>>>>>> was copied from [3] . Added by @Daniel Weeks <dwe...@netflix.com> .
>>>>>> Thanks for that!
>>>>>>  - `VectorizedParquetValueReaders` contains ParquetValueReaders used
>>>>>> for reading/decoding the Parquet rowgroups (aka pagestores as referred to
>>>>>> in the code)
>>>>>>  - `VectorizedSparkParquetReaders` contains the visitor
>>>>>> implementations to map Parquet types to appropriate value readers. I
>>>>>> implemented the struct visitor so that the root schema can be mapped
>>>>>> properly. This has the added benefit of vectorization support for 
>>>>>> structs,
>>>>>> so yay!
>>>>>>  - For the initial version the value readers read an entire row group
>>>>>> into a single Arrow Field Vector. this i'd imagine will require tuning 
>>>>>> for
>>>>>> right batch sizing but i'v gone with one batch per rowgroup for now.
>>>>>>  - Arrow Field Vectors are wrapped using `ArrowColumnVector` which is
>>>>>> Spark's ColumnVector implementation backed by Arrow. This is the first
>>>>>> contact point between Spark and Arrow interfaces.
>>>>>>  - ArrowColumnVectors are stitched together into a `ColumnarBatch` by
>>>>>> `ColumnarBatchReader` . This is my replacement for `InternalRowReader`
>>>>>> which maps Structs to Columnar Batches. This allows us to have nested
>>>>>> structs where each level of nesting would be a nested columnar batch. 
>>>>>> Lemme
>>>>>> know what you think of this approach.
>>>>>>  - I'v added value readers for all supported primitive types listed
>>>>>> in `AvroDataTest`. There's a corresponding test for vectorized reader 
>>>>>> under
>>>>>> `TestSparkParquetVectorizedReader`
>>>>>>  - I haven't fixed all the Checkstyle errors so you will have to turn
>>>>>> checkstyle off in build.gradle. Also skip tests while building.. sorry! 
>>>>>> :-(
>>>>>>
>>>>>> *P.S*. There's some unused code under ArrowReader.java. Ignore this
>>>>>> as it's not used. This was from my previous impl of Vectorization. I'v 
>>>>>> kept
>>>>>> it around to compare performance.
>>>>>>
>>>>>> Lemme know what folks think of the approach. I'm getting this working
>>>>>> for our scale test benchmark and will report back with numbers. Feel free
>>>>>> to run your own benchmarks and share.
>>>>>>
>>>>>> Cheers,
>>>>>> -Gautam.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [1] -
>>>>>> https://github.com/prodeezy/incubator-iceberg/tree/issue-9-support-arrow-based-reading-WIP
>>>>>> [2] -
>>>>>> https://github.com/apache/incubator-iceberg/compare/master...prodeezy:issue-9-support-arrow-based-reading-WIP
>>>>>> [3] -
>>>>>> https://github.com/apache/incubator-iceberg/blob/72e3485510e9cbec05dd30e2e7ce5d03071f400d/core/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 22, 2019 at 2:33 PM Gautam <gautamkows...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Will do. Doing a bit of housekeeping on the code and also adding
>>>>>>> more primitive type support.
>>>>>>>
>>>>>>> On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah <mch...@palantir.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Would it be possible to put the work in progress code in open
>>>>>>>> source?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From: *Gautam <gautamkows...@gmail.com>
>>>>>>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>
>>>>>>>> *Date: *Monday, July 22, 2019 at 9:46 AM
>>>>>>>> *To: *Daniel Weeks <dwe...@netflix.com>
>>>>>>>> *Cc: *Ryan Blue <rb...@netflix.com>, Iceberg Dev List <
>>>>>>>> dev@iceberg.apache.org>
>>>>>>>> *Subject: *Re: Approaching Vectorized Reading in Iceberg ..
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> That would be great!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jul 22, 2019 at 9:12 AM Daniel Weeks <dwe...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hey Gautam,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We also have a couple people looking into vectorized reading (into
>>>>>>>> Arrow memory).  I think it would be good for us to get together and 
>>>>>>>> see if
>>>>>>>> we can collaborate on a common approach for this.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'll reach out directly and see if we can get together.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Jul 21, 2019 at 10:35 PM Gautam <gautamkows...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Figured this out. I'm returning ColumnarBatch iterator directly
>>>>>>>> without projection with schema set appropriately in `readSchema() `.. 
>>>>>>>> the
>>>>>>>> empty result was due to valuesRead not being set correctly on 
>>>>>>>> FileIterator.
>>>>>>>> Did that and things are working. Will circle back with numbers soon.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jul 19, 2019 at 5:22 PM Gautam <gautamkows...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hey Guys,
>>>>>>>>
>>>>>>>>            Sorry bout the delay on this. Just got back on getting a
>>>>>>>> basic working implementation in Iceberg for Vectorization on primitive
>>>>>>>> types.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Here's what I have so far :  *
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have added `ParquetValueReader` implementations for some basic
>>>>>>>> primitive types that build the respective Arrow Vector (`ValueVector`) 
>>>>>>>> viz.
>>>>>>>> `IntVector` for int, `VarCharVector` for strings and so on. Underneath 
>>>>>>>> each
>>>>>>>> value vector reader there are column iterators that read from the 
>>>>>>>> parquet
>>>>>>>> pagestores (rowgroups) in chunks. These `ValueVector-s` are lined up as
>>>>>>>> `ArrowColumnVector`-s (which is ColumnVector wrapper backed by Arrow) 
>>>>>>>> and
>>>>>>>> stitched together using a `ColumnarBatchReader` (which as the name 
>>>>>>>> suggests
>>>>>>>> wraps ColumnarBatches in the iterator)   I'v verified that these pieces
>>>>>>>> work properly with the underlying interfaces.  I'v also made changes to
>>>>>>>> Iceberg's `Reader` to  implement `planBatchPartitions()` (to add the
>>>>>>>> `SupportsScanColumnarBatch` mixin to the reader).  So the reader now
>>>>>>>> expects ColumnarBatch instances (instead of InternalRow). The query
>>>>>>>> planning runtime works fine with these changes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Although it fails during query execution, the bit it's  currently
>>>>>>>> failing at is this line of code : 
>>>>>>>> https://github.com/apache/incubator-iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/Reader.java#L414
>>>>>>>> [github.com]
>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_blob_master_spark_src_main_java_org_apache_iceberg_spark_source_Reader.java-23L414&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=UW1Nb5KZOPeIqsjzFnKhGQaxYHT_wAI_2PvgFUlfAoY&s=7wzoBoRwCjQjgamnHukQSe0wiATMnGbYhfJQpXfSMks&e=>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> This code, I think,  tries to apply the iterator's schema
>>>>>>>> projection on the InternalRow instances. This seems to be tightly 
>>>>>>>> coupled
>>>>>>>> to InternalRow as Spark's catalyst expressions have implemented the
>>>>>>>> UnsafeProjection for InternalRow only. If I take this out and just 
>>>>>>>> return
>>>>>>>> the `Iterator<ColumnarBatch>` iterator I built it returns empty result 
>>>>>>>> on
>>>>>>>> the client. I'm guessing this is coz Spark is unaware of the iterator's
>>>>>>>> schema? There's a Todo in the code that says "*remove the
>>>>>>>> projection by reporting the iterator's schema back to Spark*".  Is
>>>>>>>> there a simple way to communicate that to Spark for my new iterator? 
>>>>>>>> Any
>>>>>>>> pointers on how to get around this?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks and Regards,
>>>>>>>>
>>>>>>>> -Gautam.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 14, 2019 at 4:22 PM Ryan Blue <rb...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Replies inline.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 14, 2019 at 1:11 AM Gautam <gautamkows...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Thanks for responding Ryan,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Couple of follow up questions on ParquetValueReader for Arrow..
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'd like to start with testing Arrow out with readers for primitive
>>>>>>>> type and incrementally add in Struct/Array support, also ArrowWriter 
>>>>>>>> [1]
>>>>>>>> currently doesn't have converters for map type. How can I default these
>>>>>>>> types to regular materialization whilst supporting Arrow based support 
>>>>>>>> for
>>>>>>>> primitives?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We should look at what Spark does to handle maps.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I think we should get the prototype working with test cases that
>>>>>>>> don't have maps, structs, or lists. Just getting primitives working is 
>>>>>>>> a
>>>>>>>> good start and just won't hit these problems.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Lemme know if this makes sense...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> - I extend  PrimitiveReader (for Arrow) that loads primitive types
>>>>>>>> into ArrowColumnVectors of corresponding column types by iterating over
>>>>>>>> underlying ColumnIterator *n times*, where n is size of batch.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Sounds good to me. I'm not sure about extending vs wrapping because
>>>>>>>> I'm not too familiar with the Arrow APIs.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> - Reader.newParquetIterable()  maps primitive column types to the
>>>>>>>> newly added ArrowParquetValueReader but for other types (nested types,
>>>>>>>> etc.) uses current *InternalRow* based ValueReaders
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Sounds good for primitives, but I would just leave the nested types
>>>>>>>> un-implemented for now.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> - Stitch the columns vectors together to create ColumnarBatch,
>>>>>>>> (Since *SupportsScanColumnarBatch* mixin currently expects this )
>>>>>>>> .. *although* *I'm a bit lost on how the stitching of columns
>>>>>>>> happens currently*? .. and how the ArrowColumnVectors could  be
>>>>>>>> stitched alongside regular columns that don't have arrow based support 
>>>>>>>> ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I don't think that you can mix regular columns and Arrow columns.
>>>>>>>> It has to be all one or the other. That's why it's easier to start with
>>>>>>>> primitives, then add structs, then lists, and finally maps.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> - Reader returns readTasks as  *InputPartition<*ColumnarBatch*> *so
>>>>>>>> that DataSourceV2ScanExec starts using ColumnarBatch scans
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We will probably need two paths. One for columnar batches and one
>>>>>>>> for row-based reads. That doesn't need to be done right away and what 
>>>>>>>> you
>>>>>>>> already have in your working copy makes sense as a start.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> That's a lot of questions! :-) but hope i'm making sense.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -Gautam.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [1] - 
>>>>>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala
>>>>>>>> [github.com]
>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_sql_core_src_main_scala_org_apache_spark_sql_execution_arrow_ArrowWriter.scala&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=UW1Nb5KZOPeIqsjzFnKhGQaxYHT_wAI_2PvgFUlfAoY&s=8yzJh2S49rbuM06dC5Sy-yMECClqEeLS7tpg45BmDN4&e=>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Ryan Blue
>>>>>>>>
>>>>>>>> Software Engineer
>>>>>>>>
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>

Re: Approaching Vectorized Reading in Iceberg ..

Reply via email to