Got it. Commented out that module and it works. Was just curious why it doesn't work on master branch either.
On Fri, Jul 26, 2019 at 3:49 PM Daniel Weeks <dwe...@netflix.com> wrote: > Actually, it looks like the issue is right there in the error . . . the > ErrorProne module is being excluded from the compile stages of the > sub-projects here: > https://github.com/apache/incubator-iceberg/blob/master/build.gradle#L152 > > However, it is still being applied to the jmh tasks. I'm not familiar > with this module, but you can run the benchmarks by commenting it out here: > https://github.com/apache/incubator-iceberg/blob/master/build.gradle#L167 > > We'll need to fix the build to disable for the jmh tasks. > > -Dan > > On Fri, Jul 26, 2019 at 3:34 PM Daniel Weeks <dwe...@netflix.com> wrote: > >> Gautam, you need to have the jmh-core libraries available to run. I >> validated that PR, so I'm guessing I had it configured in my environment. >> >> I assume there's a way to make that available within gradle, so I'll take >> a look. >> >> On Fri, Jul 26, 2019 at 2:52 PM Gautam <gautamkows...@gmail.com> wrote: >> >>> This fails on master too btw. Just wondering if i'm doing >>> something wrong trying to run this. >>> >>> On Fri, Jul 26, 2019 at 2:24 PM Gautam <gautamkows...@gmail.com> wrote: >>> >>>> I'v been trying to run the jmh benchmarks bundled within the project. >>>> I'v been running into issues with that .. have other hit this? Am I running >>>> these incorrectly? >>>> >>>> >>>> bash-3.2$ ./gradlew :iceberg-spark:jmh >>>> -PjmhIncludeRegex=IcebergSourceFlatParquetDataFilterBenchmark >>>> -PjmhOutputPath=benchmark/iceberg-source-flat-parquet-data-filter-benchmark-result.txt >>>> .. >>>> ... >>>> > Task :iceberg-spark:jmhCompileGeneratedClasses FAILED >>>> error: plug-in not found: ErrorProne >>>> >>>> FAILURE: Build failed with an exception. >>>> >>>> >>>> >>>> Is there a config/plugin I need to add to build.gradle? >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jul 24, 2019 at 2:03 PM Ryan Blue <rb...@netflix.com> wrote: >>>> >>>>> Thanks Gautam! >>>>> >>>>> We'll start taking a look at your code. What do you think about >>>>> creating a branch in the Iceberg repository where we can work on improving >>>>> it together, before merging it into master? >>>>> >>>>> Also, you mentioned performance comparisons. Do you have any early >>>>> results to share? >>>>> >>>>> rb >>>>> >>>>> On Tue, Jul 23, 2019 at 3:40 PM Gautam <gautamkows...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hello Folks, >>>>>> >>>>>> I have checked in a WIP branch [1] with a working version of >>>>>> Vectorized reads for Iceberg reader. Here's the diff [2]. >>>>>> >>>>>> *Implementation Notes:* >>>>>> - Iceberg's Reader adds a `SupportsScanColumnarBatch` mixin to >>>>>> instruct the DataSourceV2ScanExec to use `planBatchPartitions()` instead >>>>>> of >>>>>> the usual `planInputPartitions()`. It returns instances of >>>>>> `ColumnarBatch` >>>>>> on each iteration. >>>>>> - `ArrowSchemaUtil` contains Iceberg to Arrow type conversion. This >>>>>> was copied from [3] . Added by @Daniel Weeks <dwe...@netflix.com> . >>>>>> Thanks for that! >>>>>> - `VectorizedParquetValueReaders` contains ParquetValueReaders used >>>>>> for reading/decoding the Parquet rowgroups (aka pagestores as referred to >>>>>> in the code) >>>>>> - `VectorizedSparkParquetReaders` contains the visitor >>>>>> implementations to map Parquet types to appropriate value readers. I >>>>>> implemented the struct visitor so that the root schema can be mapped >>>>>> properly. This has the added benefit of vectorization support for >>>>>> structs, >>>>>> so yay! >>>>>> - For the initial version the value readers read an entire row group >>>>>> into a single Arrow Field Vector. this i'd imagine will require tuning >>>>>> for >>>>>> right batch sizing but i'v gone with one batch per rowgroup for now. >>>>>> - Arrow Field Vectors are wrapped using `ArrowColumnVector` which is >>>>>> Spark's ColumnVector implementation backed by Arrow. This is the first >>>>>> contact point between Spark and Arrow interfaces. >>>>>> - ArrowColumnVectors are stitched together into a `ColumnarBatch` by >>>>>> `ColumnarBatchReader` . This is my replacement for `InternalRowReader` >>>>>> which maps Structs to Columnar Batches. This allows us to have nested >>>>>> structs where each level of nesting would be a nested columnar batch. >>>>>> Lemme >>>>>> know what you think of this approach. >>>>>> - I'v added value readers for all supported primitive types listed >>>>>> in `AvroDataTest`. There's a corresponding test for vectorized reader >>>>>> under >>>>>> `TestSparkParquetVectorizedReader` >>>>>> - I haven't fixed all the Checkstyle errors so you will have to turn >>>>>> checkstyle off in build.gradle. Also skip tests while building.. sorry! >>>>>> :-( >>>>>> >>>>>> *P.S*. There's some unused code under ArrowReader.java. Ignore this >>>>>> as it's not used. This was from my previous impl of Vectorization. I'v >>>>>> kept >>>>>> it around to compare performance. >>>>>> >>>>>> Lemme know what folks think of the approach. I'm getting this working >>>>>> for our scale test benchmark and will report back with numbers. Feel free >>>>>> to run your own benchmarks and share. >>>>>> >>>>>> Cheers, >>>>>> -Gautam. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> [1] - >>>>>> https://github.com/prodeezy/incubator-iceberg/tree/issue-9-support-arrow-based-reading-WIP >>>>>> [2] - >>>>>> https://github.com/apache/incubator-iceberg/compare/master...prodeezy:issue-9-support-arrow-based-reading-WIP >>>>>> [3] - >>>>>> https://github.com/apache/incubator-iceberg/blob/72e3485510e9cbec05dd30e2e7ce5d03071f400d/core/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java >>>>>> >>>>>> >>>>>> On Mon, Jul 22, 2019 at 2:33 PM Gautam <gautamkows...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Will do. Doing a bit of housekeeping on the code and also adding >>>>>>> more primitive type support. >>>>>>> >>>>>>> On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah <mch...@palantir.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Would it be possible to put the work in progress code in open >>>>>>>> source? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *From: *Gautam <gautamkows...@gmail.com> >>>>>>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org> >>>>>>>> *Date: *Monday, July 22, 2019 at 9:46 AM >>>>>>>> *To: *Daniel Weeks <dwe...@netflix.com> >>>>>>>> *Cc: *Ryan Blue <rb...@netflix.com>, Iceberg Dev List < >>>>>>>> dev@iceberg.apache.org> >>>>>>>> *Subject: *Re: Approaching Vectorized Reading in Iceberg .. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> That would be great! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jul 22, 2019 at 9:12 AM Daniel Weeks <dwe...@netflix.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hey Gautam, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> We also have a couple people looking into vectorized reading (into >>>>>>>> Arrow memory). I think it would be good for us to get together and >>>>>>>> see if >>>>>>>> we can collaborate on a common approach for this. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I'll reach out directly and see if we can get together. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -Dan >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Jul 21, 2019 at 10:35 PM Gautam <gautamkows...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Figured this out. I'm returning ColumnarBatch iterator directly >>>>>>>> without projection with schema set appropriately in `readSchema() `.. >>>>>>>> the >>>>>>>> empty result was due to valuesRead not being set correctly on >>>>>>>> FileIterator. >>>>>>>> Did that and things are working. Will circle back with numbers soon. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jul 19, 2019 at 5:22 PM Gautam <gautamkows...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hey Guys, >>>>>>>> >>>>>>>> Sorry bout the delay on this. Just got back on getting a >>>>>>>> basic working implementation in Iceberg for Vectorization on primitive >>>>>>>> types. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Here's what I have so far : * >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I have added `ParquetValueReader` implementations for some basic >>>>>>>> primitive types that build the respective Arrow Vector (`ValueVector`) >>>>>>>> viz. >>>>>>>> `IntVector` for int, `VarCharVector` for strings and so on. Underneath >>>>>>>> each >>>>>>>> value vector reader there are column iterators that read from the >>>>>>>> parquet >>>>>>>> pagestores (rowgroups) in chunks. These `ValueVector-s` are lined up as >>>>>>>> `ArrowColumnVector`-s (which is ColumnVector wrapper backed by Arrow) >>>>>>>> and >>>>>>>> stitched together using a `ColumnarBatchReader` (which as the name >>>>>>>> suggests >>>>>>>> wraps ColumnarBatches in the iterator) I'v verified that these pieces >>>>>>>> work properly with the underlying interfaces. I'v also made changes to >>>>>>>> Iceberg's `Reader` to implement `planBatchPartitions()` (to add the >>>>>>>> `SupportsScanColumnarBatch` mixin to the reader). So the reader now >>>>>>>> expects ColumnarBatch instances (instead of InternalRow). The query >>>>>>>> planning runtime works fine with these changes. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Although it fails during query execution, the bit it's currently >>>>>>>> failing at is this line of code : >>>>>>>> https://github.com/apache/incubator-iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/Reader.java#L414 >>>>>>>> [github.com] >>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_blob_master_spark_src_main_java_org_apache_iceberg_spark_source_Reader.java-23L414&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=UW1Nb5KZOPeIqsjzFnKhGQaxYHT_wAI_2PvgFUlfAoY&s=7wzoBoRwCjQjgamnHukQSe0wiATMnGbYhfJQpXfSMks&e=> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> This code, I think, tries to apply the iterator's schema >>>>>>>> projection on the InternalRow instances. This seems to be tightly >>>>>>>> coupled >>>>>>>> to InternalRow as Spark's catalyst expressions have implemented the >>>>>>>> UnsafeProjection for InternalRow only. If I take this out and just >>>>>>>> return >>>>>>>> the `Iterator<ColumnarBatch>` iterator I built it returns empty result >>>>>>>> on >>>>>>>> the client. I'm guessing this is coz Spark is unaware of the iterator's >>>>>>>> schema? There's a Todo in the code that says "*remove the >>>>>>>> projection by reporting the iterator's schema back to Spark*". Is >>>>>>>> there a simple way to communicate that to Spark for my new iterator? >>>>>>>> Any >>>>>>>> pointers on how to get around this? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thanks and Regards, >>>>>>>> >>>>>>>> -Gautam. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 14, 2019 at 4:22 PM Ryan Blue <rb...@netflix.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Replies inline. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 14, 2019 at 1:11 AM Gautam <gautamkows...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Thanks for responding Ryan, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Couple of follow up questions on ParquetValueReader for Arrow.. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I'd like to start with testing Arrow out with readers for primitive >>>>>>>> type and incrementally add in Struct/Array support, also ArrowWriter >>>>>>>> [1] >>>>>>>> currently doesn't have converters for map type. How can I default these >>>>>>>> types to regular materialization whilst supporting Arrow based support >>>>>>>> for >>>>>>>> primitives? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> We should look at what Spark does to handle maps. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I think we should get the prototype working with test cases that >>>>>>>> don't have maps, structs, or lists. Just getting primitives working is >>>>>>>> a >>>>>>>> good start and just won't hit these problems. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Lemme know if this makes sense... >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> - I extend PrimitiveReader (for Arrow) that loads primitive types >>>>>>>> into ArrowColumnVectors of corresponding column types by iterating over >>>>>>>> underlying ColumnIterator *n times*, where n is size of batch. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Sounds good to me. I'm not sure about extending vs wrapping because >>>>>>>> I'm not too familiar with the Arrow APIs. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> - Reader.newParquetIterable() maps primitive column types to the >>>>>>>> newly added ArrowParquetValueReader but for other types (nested types, >>>>>>>> etc.) uses current *InternalRow* based ValueReaders >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Sounds good for primitives, but I would just leave the nested types >>>>>>>> un-implemented for now. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> - Stitch the columns vectors together to create ColumnarBatch, >>>>>>>> (Since *SupportsScanColumnarBatch* mixin currently expects this ) >>>>>>>> .. *although* *I'm a bit lost on how the stitching of columns >>>>>>>> happens currently*? .. and how the ArrowColumnVectors could be >>>>>>>> stitched alongside regular columns that don't have arrow based support >>>>>>>> ? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I don't think that you can mix regular columns and Arrow columns. >>>>>>>> It has to be all one or the other. That's why it's easier to start with >>>>>>>> primitives, then add structs, then lists, and finally maps. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> - Reader returns readTasks as *InputPartition<*ColumnarBatch*> *so >>>>>>>> that DataSourceV2ScanExec starts using ColumnarBatch scans >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> We will probably need two paths. One for columnar batches and one >>>>>>>> for row-based reads. That doesn't need to be done right away and what >>>>>>>> you >>>>>>>> already have in your working copy makes sense as a start. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> That's a lot of questions! :-) but hope i'm making sense. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -Gautam. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [1] - >>>>>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala >>>>>>>> [github.com] >>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_sql_core_src_main_scala_org_apache_spark_sql_execution_arrow_ArrowWriter.scala&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=UW1Nb5KZOPeIqsjzFnKhGQaxYHT_wAI_2PvgFUlfAoY&s=8yzJh2S49rbuM06dC5Sy-yMECClqEeLS7tpg45BmDN4&e=> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Ryan Blue >>>>>>>> >>>>>>>> Software Engineer >>>>>>>> >>>>>>>> Netflix >>>>>>>> >>>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>>