Ah yes, I didn't send over the filter benchmarks .. Num files : 500 Num rows per file: 10,000
*Benchmark Mode Cnt Score Error Units* IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterFileSourceNonVectorized ss 5 3.837 ± 0.424 s/op IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterFileSourceVectorized ss 5 3.964 ± 1.891 s/op IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIceberg ss 5 0.272 ± 0.039 s/op IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergVect100k ss 5 0.274 ± 0.013 s/op IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergVect10k ss 5 0.275 ± 0.040 s/op IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergVect5k ss 5 0.273 ± 0.031 s/op On Wed, Jul 31, 2019 at 2:35 PM Anjali Norwood <anorw...@netflix.com.invalid> wrote: > Hi Gautam, > > You wrote: ' - The filters are not being applied in columnar fashion they > are being applied row by row as in Iceberg each filter visitor is stateless > and applied separately on each row's column. ' .. this should not be a > problem for this particular benchmark as IcebergSourceFlatParquetDataRe > adBenchmark does not apply filters. > > -Anjali. > > On Wed, Jul 31, 2019 at 1:44 PM Gautam <gautamkows...@gmail.com> wrote: > >> Hey Samarth, >> Sorry bout the delay. I ran into some bottlenecks for which >> I had to add more code to be able to run benchmarks. I'v checked in my >> latest changes to my fork's *vectorized-read* branch [0]. >> >> Here's the early numbers on the initial implementation... >> >> *Benchmark Data:* >> - 10 files >> - 9MB each >> - 1Millon rows (1 RowGroup) >> >> Ran benchmark using the jmh benchmark tool within >> incubator-iceberg/spark/src/jmh >> using batch different sizes and compared it to spark's vectorization and >> non-vectorized reader. >> >> *Command: * >> ./gradlew clean :iceberg-spark:jmh >> -PjmhIncludeRegex=IcebergSourceFlatParquetDataReadBenchmark >> -PjmhOutputPath=benchmark/iceberg-source-flat-parquet-data-read-benchmark-result.txt >> >> >> >> *Benchmark >> Mode Cnt Score Error Units* >> IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized >> ss 5 16.172 ± 0.750 s/op >> IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized >> ss 5 6.430 ± 0.136 s/op >> IcebergSourceFlatParquetDataReadBenchmark.readIceberg >> ss 5 15.287 ± 0.212 s/op >> >> >> >> *IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized100k >> ss 5 18.310 ± 0.498 >> s/opIcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized10k >> ss 5 18.020 ± 0.378 >> s/opIcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized5k >> ss 5 17.769 ± 0.412 >> s/op*IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceNonVectorized >> ss 5 2.794 ± 0.141 s/op >> IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceVectorized >> ss 5 1.063 ± 0.140 s/op >> IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIceberg >> ss 5 2.966 ± 0.133 s/op >> >> >> *IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergVectorized100k >> ss 5 2.015 ± 0.261 >> s/opIcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergVectorized10k >> ss 5 1.972 ± 0.105 >> s/opIcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergVectorized5k >> ss 5 2.065 ± 0.079 s/op* >> >> >> >> So seems like there's no improvement that vectorization is adding over >> the non-vectorized reading. I'm currently trying to profile where the time >> is being spent. >> >> *Here is my initial speculation of why this is slow:* >> - There's too much overhead that seems to be from creating the batches. >> i'm creating new instance of ColumnarBatch on each read [1] . This should >> prolly be re-used. >> - Although I am reusing the *FieldVector* across batched reads [2] I >> wrap them in new *ArrowColumnVector*s [3] on each read call. I didn't >> think this would be a big deal but maybe it is. >> - The filters are not being applied in columnar fashion they are being >> applied row by row as in Iceberg each filter visitor is stateless and >> applied separately on each row's column. >> - I'm trying to re-use the BufferAllocator that Arrow provides [4] .. >> Dunno if there are other strategies to using this. Will look more into this. >> - I'm batching until the rowgroup ends and restricting the last batch to >> the Rowgroup boundary. I should prolly spill over to the next rowgroup to >> fill that batch. Dunno if this would help as from what i can tell I don't >> think *VectorizedParquetRecordReader *does this. >> >> I'l try and provide more insights once i improve my code. But if there's >> other insights folks have on where we can improve on things, i'd gladly try >> them. >> >> Cheers, >> - Gautam. >> >> [0] - https://github.com/prodeezy/incubator-iceberg/tree/vectorized-read >> [1] - >> https://github.com/prodeezy/incubator-iceberg/blob/vectorized-read/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java#L655 >> [2] - >> https://github.com/prodeezy/incubator-iceberg/blob/vectorized-read/spark/src/main/java/org/apache/iceberg/spark/data/vector/VectorizedParquetValueReaders.java#L108 >> [3] - >> https://github.com/prodeezy/incubator-iceberg/blob/vectorized-read/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java#L651 >> [4] - >> https://github.com/prodeezy/incubator-iceberg/blob/vectorized-read/spark/src/main/java/org/apache/iceberg/spark/data/vector/VectorizedSparkParquetReaders.java#L92 >> >> >> On Tue, Jul 30, 2019 at 5:13 PM Samarth Jain <samarth.j...@gmail.com> >> wrote: >> >>> Hey Gautam, >>> >>> Wanted to check back with you and see if you had any success running the >>> benchmark and if you have any numbers to share. >>> >>> >>> >>> On Fri, Jul 26, 2019 at 4:34 PM Gautam <gautamkows...@gmail.com> wrote: >>> >>>> Got it. Commented out that module and it works. Was just curious why it >>>> doesn't work on master branch either. >>>> >>>> On Fri, Jul 26, 2019 at 3:49 PM Daniel Weeks <dwe...@netflix.com> >>>> wrote: >>>> >>>>> Actually, it looks like the issue is right there in the error . . . >>>>> the ErrorProne module is being excluded from the compile stages of the >>>>> sub-projects here: >>>>> https://github.com/apache/incubator-iceberg/blob/master/build.gradle#L152 >>>>> >>>>> However, it is still being applied to the jmh tasks. I'm not familiar >>>>> with this module, but you can run the benchmarks by commenting it out >>>>> here: >>>>> https://github.com/apache/incubator-iceberg/blob/master/build.gradle#L167 >>>>> >>>>> We'll need to fix the build to disable for the jmh tasks. >>>>> >>>>> -Dan >>>>> >>>>> On Fri, Jul 26, 2019 at 3:34 PM Daniel Weeks <dwe...@netflix.com> >>>>> wrote: >>>>> >>>>>> Gautam, you need to have the jmh-core libraries available to run. I >>>>>> validated that PR, so I'm guessing I had it configured in my environment. >>>>>> >>>>>> I assume there's a way to make that available within gradle, so I'll >>>>>> take a look. >>>>>> >>>>>> On Fri, Jul 26, 2019 at 2:52 PM Gautam <gautamkows...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> This fails on master too btw. Just wondering if i'm doing >>>>>>> something wrong trying to run this. >>>>>>> >>>>>>> On Fri, Jul 26, 2019 at 2:24 PM Gautam <gautamkows...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I'v been trying to run the jmh benchmarks bundled within the >>>>>>>> project. I'v been running into issues with that .. have other hit >>>>>>>> this? Am >>>>>>>> I running these incorrectly? >>>>>>>> >>>>>>>> >>>>>>>> bash-3.2$ ./gradlew :iceberg-spark:jmh >>>>>>>> -PjmhIncludeRegex=IcebergSourceFlatParquetDataFilterBenchmark >>>>>>>> -PjmhOutputPath=benchmark/iceberg-source-flat-parquet-data-filter-benchmark-result.txt >>>>>>>> .. >>>>>>>> ... >>>>>>>> > Task :iceberg-spark:jmhCompileGeneratedClasses FAILED >>>>>>>> error: plug-in not found: ErrorProne >>>>>>>> >>>>>>>> FAILURE: Build failed with an exception. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Is there a config/plugin I need to add to build.gradle? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jul 24, 2019 at 2:03 PM Ryan Blue <rb...@netflix.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks Gautam! >>>>>>>>> >>>>>>>>> We'll start taking a look at your code. What do you think about >>>>>>>>> creating a branch in the Iceberg repository where we can work on >>>>>>>>> improving >>>>>>>>> it together, before merging it into master? >>>>>>>>> >>>>>>>>> Also, you mentioned performance comparisons. Do you have any early >>>>>>>>> results to share? >>>>>>>>> >>>>>>>>> rb >>>>>>>>> >>>>>>>>> On Tue, Jul 23, 2019 at 3:40 PM Gautam <gautamkows...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hello Folks, >>>>>>>>>> >>>>>>>>>> I have checked in a WIP branch [1] with a working version of >>>>>>>>>> Vectorized reads for Iceberg reader. Here's the diff [2]. >>>>>>>>>> >>>>>>>>>> *Implementation Notes:* >>>>>>>>>> - Iceberg's Reader adds a `SupportsScanColumnarBatch` mixin to >>>>>>>>>> instruct the DataSourceV2ScanExec to use `planBatchPartitions()` >>>>>>>>>> instead of >>>>>>>>>> the usual `planInputPartitions()`. It returns instances of >>>>>>>>>> `ColumnarBatch` >>>>>>>>>> on each iteration. >>>>>>>>>> - `ArrowSchemaUtil` contains Iceberg to Arrow type conversion. >>>>>>>>>> This was copied from [3] . Added by @Daniel Weeks >>>>>>>>>> <dwe...@netflix.com> . Thanks for that! >>>>>>>>>> - `VectorizedParquetValueReaders` contains ParquetValueReaders >>>>>>>>>> used for reading/decoding the Parquet rowgroups (aka pagestores as >>>>>>>>>> referred >>>>>>>>>> to in the code) >>>>>>>>>> - `VectorizedSparkParquetReaders` contains the visitor >>>>>>>>>> implementations to map Parquet types to appropriate value readers. I >>>>>>>>>> implemented the struct visitor so that the root schema can be mapped >>>>>>>>>> properly. This has the added benefit of vectorization support for >>>>>>>>>> structs, >>>>>>>>>> so yay! >>>>>>>>>> - For the initial version the value readers read an entire row >>>>>>>>>> group into a single Arrow Field Vector. this i'd imagine will require >>>>>>>>>> tuning for right batch sizing but i'v gone with one batch per >>>>>>>>>> rowgroup for >>>>>>>>>> now. >>>>>>>>>> - Arrow Field Vectors are wrapped using `ArrowColumnVector` >>>>>>>>>> which is Spark's ColumnVector implementation backed by Arrow. This >>>>>>>>>> is the >>>>>>>>>> first contact point between Spark and Arrow interfaces. >>>>>>>>>> - ArrowColumnVectors are stitched together into a >>>>>>>>>> `ColumnarBatch` by `ColumnarBatchReader` . This is my replacement for >>>>>>>>>> `InternalRowReader` which maps Structs to Columnar Batches. This >>>>>>>>>> allows us >>>>>>>>>> to have nested structs where each level of nesting would be a nested >>>>>>>>>> columnar batch. Lemme know what you think of this approach. >>>>>>>>>> - I'v added value readers for all supported primitive types >>>>>>>>>> listed in `AvroDataTest`. There's a corresponding test for vectorized >>>>>>>>>> reader under `TestSparkParquetVectorizedReader` >>>>>>>>>> - I haven't fixed all the Checkstyle errors so you will have to >>>>>>>>>> turn checkstyle off in build.gradle. Also skip tests while building.. >>>>>>>>>> sorry! :-( >>>>>>>>>> >>>>>>>>>> *P.S*. There's some unused code under ArrowReader.java. Ignore >>>>>>>>>> this as it's not used. This was from my previous impl of >>>>>>>>>> Vectorization. I'v >>>>>>>>>> kept it around to compare performance. >>>>>>>>>> >>>>>>>>>> Lemme know what folks think of the approach. I'm getting this >>>>>>>>>> working for our scale test benchmark and will report back with >>>>>>>>>> numbers. >>>>>>>>>> Feel free to run your own benchmarks and share. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> -Gautam. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [1] - >>>>>>>>>> https://github.com/prodeezy/incubator-iceberg/tree/issue-9-support-arrow-based-reading-WIP >>>>>>>>>> [2] - >>>>>>>>>> https://github.com/apache/incubator-iceberg/compare/master...prodeezy:issue-9-support-arrow-based-reading-WIP >>>>>>>>>> [3] - >>>>>>>>>> https://github.com/apache/incubator-iceberg/blob/72e3485510e9cbec05dd30e2e7ce5d03071f400d/core/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Jul 22, 2019 at 2:33 PM Gautam <gautamkows...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Will do. Doing a bit of housekeeping on the code and also adding >>>>>>>>>>> more primitive type support. >>>>>>>>>>> >>>>>>>>>>> On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah <mch...@palantir.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Would it be possible to put the work in progress code in open >>>>>>>>>>>> source? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *From: *Gautam <gautamkows...@gmail.com> >>>>>>>>>>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org> >>>>>>>>>>>> *Date: *Monday, July 22, 2019 at 9:46 AM >>>>>>>>>>>> *To: *Daniel Weeks <dwe...@netflix.com> >>>>>>>>>>>> *Cc: *Ryan Blue <rb...@netflix.com>, Iceberg Dev List < >>>>>>>>>>>> dev@iceberg.apache.org> >>>>>>>>>>>> *Subject: *Re: Approaching Vectorized Reading in Iceberg .. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> That would be great! >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jul 22, 2019 at 9:12 AM Daniel Weeks < >>>>>>>>>>>> dwe...@netflix.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hey Gautam, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> We also have a couple people looking into vectorized reading >>>>>>>>>>>> (into Arrow memory). I think it would be good for us to get >>>>>>>>>>>> together and >>>>>>>>>>>> see if we can collaborate on a common approach for this. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'll reach out directly and see if we can get together. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -Dan >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Jul 21, 2019 at 10:35 PM Gautam < >>>>>>>>>>>> gautamkows...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Figured this out. I'm returning ColumnarBatch iterator directly >>>>>>>>>>>> without projection with schema set appropriately in `readSchema() >>>>>>>>>>>> `.. the >>>>>>>>>>>> empty result was due to valuesRead not being set correctly on >>>>>>>>>>>> FileIterator. >>>>>>>>>>>> Did that and things are working. Will circle back with numbers >>>>>>>>>>>> soon. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:22 PM Gautam <gautamkows...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hey Guys, >>>>>>>>>>>> >>>>>>>>>>>> Sorry bout the delay on this. Just got back on >>>>>>>>>>>> getting a basic working implementation in Iceberg for >>>>>>>>>>>> Vectorization on >>>>>>>>>>>> primitive types. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *Here's what I have so far : * >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I have added `ParquetValueReader` implementations for some >>>>>>>>>>>> basic primitive types that build the respective Arrow Vector >>>>>>>>>>>> (`ValueVector`) viz. `IntVector` for int, `VarCharVector` for >>>>>>>>>>>> strings and >>>>>>>>>>>> so on. Underneath each value vector reader there are column >>>>>>>>>>>> iterators that >>>>>>>>>>>> read from the parquet pagestores (rowgroups) in chunks. These >>>>>>>>>>>> `ValueVector-s` are lined up as `ArrowColumnVector`-s (which is >>>>>>>>>>>> ColumnVector wrapper backed by Arrow) and stitched together using a >>>>>>>>>>>> `ColumnarBatchReader` (which as the name suggests wraps >>>>>>>>>>>> ColumnarBatches in >>>>>>>>>>>> the iterator) I'v verified that these pieces work properly with >>>>>>>>>>>> the >>>>>>>>>>>> underlying interfaces. I'v also made changes to Iceberg's >>>>>>>>>>>> `Reader` to >>>>>>>>>>>> implement `planBatchPartitions()` (to add the >>>>>>>>>>>> `SupportsScanColumnarBatch` >>>>>>>>>>>> mixin to the reader). So the reader now expects ColumnarBatch >>>>>>>>>>>> instances >>>>>>>>>>>> (instead of InternalRow). The query planning runtime works fine >>>>>>>>>>>> with these >>>>>>>>>>>> changes. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Although it fails during query execution, the bit it's >>>>>>>>>>>> currently failing at is this line of code : >>>>>>>>>>>> https://github.com/apache/incubator-iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/Reader.java#L414 >>>>>>>>>>>> [github.com] >>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_blob_master_spark_src_main_java_org_apache_iceberg_spark_source_Reader.java-23L414&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=UW1Nb5KZOPeIqsjzFnKhGQaxYHT_wAI_2PvgFUlfAoY&s=7wzoBoRwCjQjgamnHukQSe0wiATMnGbYhfJQpXfSMks&e=> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> This code, I think, tries to apply the iterator's schema >>>>>>>>>>>> projection on the InternalRow instances. This seems to be tightly >>>>>>>>>>>> coupled >>>>>>>>>>>> to InternalRow as Spark's catalyst expressions have implemented the >>>>>>>>>>>> UnsafeProjection for InternalRow only. If I take this out and just >>>>>>>>>>>> return >>>>>>>>>>>> the `Iterator<ColumnarBatch>` iterator I built it returns empty >>>>>>>>>>>> result on >>>>>>>>>>>> the client. I'm guessing this is coz Spark is unaware of the >>>>>>>>>>>> iterator's >>>>>>>>>>>> schema? There's a Todo in the code that says "*remove the >>>>>>>>>>>> projection by reporting the iterator's schema back to Spark*". >>>>>>>>>>>> Is there a simple way to communicate that to Spark for my new >>>>>>>>>>>> iterator? Any >>>>>>>>>>>> pointers on how to get around this? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks and Regards, >>>>>>>>>>>> >>>>>>>>>>>> -Gautam. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jun 14, 2019 at 4:22 PM Ryan Blue <rb...@netflix.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Replies inline. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:11 AM Gautam <gautamkows...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Thanks for responding Ryan, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Couple of follow up questions on ParquetValueReader for Arrow.. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'd like to start with testing Arrow out with readers for >>>>>>>>>>>> primitive type and incrementally add in Struct/Array support, also >>>>>>>>>>>> ArrowWriter [1] currently doesn't have converters for map type. >>>>>>>>>>>> How can I >>>>>>>>>>>> default these types to regular materialization whilst supporting >>>>>>>>>>>> Arrow >>>>>>>>>>>> based support for primitives? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> We should look at what Spark does to handle maps. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I think we should get the prototype working with test cases >>>>>>>>>>>> that don't have maps, structs, or lists. Just getting primitives >>>>>>>>>>>> working is >>>>>>>>>>>> a good start and just won't hit these problems. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Lemme know if this makes sense... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> - I extend PrimitiveReader (for Arrow) that loads primitive >>>>>>>>>>>> types into ArrowColumnVectors of corresponding column types by >>>>>>>>>>>> iterating >>>>>>>>>>>> over underlying ColumnIterator *n times*, where n is size of >>>>>>>>>>>> batch. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Sounds good to me. I'm not sure about extending vs wrapping >>>>>>>>>>>> because I'm not too familiar with the Arrow APIs. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> - Reader.newParquetIterable() maps primitive column types to >>>>>>>>>>>> the newly added ArrowParquetValueReader but for other types >>>>>>>>>>>> (nested types, >>>>>>>>>>>> etc.) uses current *InternalRow* based ValueReaders >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Sounds good for primitives, but I would just leave the nested >>>>>>>>>>>> types un-implemented for now. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> - Stitch the columns vectors together to create ColumnarBatch, >>>>>>>>>>>> (Since *SupportsScanColumnarBatch* mixin currently expects >>>>>>>>>>>> this ) .. *although* *I'm a bit lost on how the stitching of >>>>>>>>>>>> columns happens currently*? .. and how the ArrowColumnVectors >>>>>>>>>>>> could be stitched alongside regular columns that don't have arrow >>>>>>>>>>>> based >>>>>>>>>>>> support ? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I don't think that you can mix regular columns and Arrow >>>>>>>>>>>> columns. It has to be all one or the other. That's why it's easier >>>>>>>>>>>> to start >>>>>>>>>>>> with primitives, then add structs, then lists, and finally maps. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> - Reader returns readTasks as *InputPartition<*ColumnarBatch*> >>>>>>>>>>>> *so that DataSourceV2ScanExec starts using ColumnarBatch scans >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> We will probably need two paths. One for columnar batches and >>>>>>>>>>>> one for row-based reads. That doesn't need to be done right away >>>>>>>>>>>> and what >>>>>>>>>>>> you already have in your working copy makes sense as a start. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> That's a lot of questions! :-) but hope i'm making sense. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -Gautam. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> [1] - >>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala >>>>>>>>>>>> [github.com] >>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_sql_core_src_main_scala_org_apache_spark_sql_execution_arrow_ArrowWriter.scala&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=UW1Nb5KZOPeIqsjzFnKhGQaxYHT_wAI_2PvgFUlfAoY&s=8yzJh2S49rbuM06dC5Sy-yMECClqEeLS7tpg45BmDN4&e=> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> Ryan Blue >>>>>>>>>>>> >>>>>>>>>>>> Software Engineer >>>>>>>>>>>> >>>>>>>>>>>> Netflix >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Software Engineer >>>>>>>>> Netflix >>>>>>>>> >>>>>>>>