Open issues against Vectorized Iceberg Read milestone

Anjali Norwood Tue, 08 Oct 2019 10:28:46 -0700

Thank you Gautam for the summary of the discussion.

Hello Devs,


The follow up vectorized iceberg tasks are now captured as issues against
the milestone.
Listed below for convenience.
https://github.com/apache/incubator-iceberg/issues/518
https://github.com/apache/incubator-iceberg/issues/519
https://github.com/apache/incubator-iceberg/issues/520
https://github.com/apache/incubator-iceberg/issues/521
https://github.com/apache/incubator-iceberg/issues/522

thanks,
Anjali.


On Mon, Oct 7, 2019 at 12:41 PM Gautam <gautamkows...@gmail.com> wrote:

> Hello Devs,
>                 We met to discuss progress and next steps on Vectorized
> read path in Iceberg. Here are my notes from the sync. Feel free to reply
> with clarifications in case I mis-quoted or missed anything.
>
> *Attendees*:
>
> Anjali Norwood
> Padma Pennumarthy
> Ryan Blue
> Samarth Jain
> Gautam Kowshik
>
> *Topics *
> - Progress on Arrow Based Vectorization Reads
> - Features being worked on and possible improvements
> - Pending bottlenecks
> - Identify things to collaborate on going forward.
> - Next steps
>
> Arrow Vectorized Reader
>
>   Samarth/Anjali:
>
>    - Working on Arrow based vectoization [1]
>    - At  performance parity between Spark and Iceberg on primitive types
>    except strings.
>    - Planning to do dictionary encoding on strings
>    - New Arrow version gives boost in performance and fixes issues
>    - Vectorized batched Reading of definition levels improves performance
>    - Some checks had to be turned off in arrow to push performance
>    further viz. null check, unsafe memory access
>    - Implemented prefetching of parquet pages, this improves perf on
>    primitives beyond Vanilla spark
>
>
>    Ryan:
>
>
>    - Arrow version should not tied to spark and have iceberg specific
>    implementation binding so it will work with any reader not just spark.
>    - Add DatasourceV2Strategy to handle nested pruning into Spark
>    upstream. Will coordinate with Apple folks to add their work into Spark.
>    - Need ability to fallback  to row based reads for cases where
>    columnar isn't possible. A config option maybe.
>    - Can add options where columnar batches are read into InternalRow and
>    returned to the Datasource.
>
>   Padma:
>
>    - Possibly contribute work on arrow back to arrow project. (can punt
>    on this for now to move forward faster on current work)
>    - Was looking into complex type support for Arrow based reads.
>
>
> V1 Vectorized Read Path [2]
>
> Gautam:
>
>    - Been working on V1 vectorized short circuit read path [3]. (this is
>    prolly not as useful once we have full featured support on Arrow based
>    reads)
>    - Will work on getting schema evolution parts working with this reader
>    by getting Projection unit/integration tests working. (this can be
>    contributed back into iceberg repo to unblock this path if we want to have
>    that option till arrow based read is fully capable)
>
>
>
> *Next steps:*
>
>    - Unit tests for current Arrow based work.
>    - Provide options to perform vectorized batch reads, Row oriented
>    reads and Internal Row over Batch reads.
>    - Separate Arrow work in Iceberg into it's own sub-module
>    - Dictionary encoding support for strings in Arrow.
>    - Complex type support for Arrow.
>    - File issues for the above and identify how to distribute work
>    between us.
>
>
>
>
> [1]  https://github.com/apache/incubator-iceberg/tree/vectorized-read
>
> [2]  https://github.com/apache/incubator-iceberg/pull/462
>
> [3]
> https://github.com/prodeezy/incubator-iceberg/commits/v1-vectorized-reader
>
>
>

Open issues against Vectorized Iceberg Read milestone

Reply via email to