Thank you Gautam for the summary of the discussion.

Hello Devs,

The follow up vectorized iceberg tasks are now captured as issues against
the milestone.
Listed below for convenience.
https://github.com/apache/incubator-iceberg/issues/518
https://github.com/apache/incubator-iceberg/issues/519
https://github.com/apache/incubator-iceberg/issues/520
https://github.com/apache/incubator-iceberg/issues/521
https://github.com/apache/incubator-iceberg/issues/522

thanks,
Anjali.


On Mon, Oct 7, 2019 at 12:41 PM Gautam <gautamkows...@gmail.com> wrote:

> Hello Devs,
>                 We met to discuss progress and next steps on Vectorized
> read path in Iceberg. Here are my notes from the sync. Feel free to reply
> with clarifications in case I mis-quoted or missed anything.
>
> *Attendees*:
>
> Anjali Norwood
> Padma Pennumarthy
> Ryan Blue
> Samarth Jain
> Gautam Kowshik
>
> *Topics *
> - Progress on Arrow Based Vectorization Reads
> - Features being worked on and possible improvements
> - Pending bottlenecks
> - Identify things to collaborate on going forward.
> - Next steps
>
> Arrow Vectorized Reader
>
>   Samarth/Anjali:
>
>    - Working on Arrow based vectoization [1]
>    - At  performance parity between Spark and Iceberg on primitive types
>    except strings.
>    - Planning to do dictionary encoding on strings
>    - New Arrow version gives boost in performance and fixes issues
>    - Vectorized batched Reading of definition levels improves performance
>    - Some checks had to be turned off in arrow to push performance
>    further viz. null check, unsafe memory access
>    - Implemented prefetching of parquet pages, this improves perf on
>    primitives beyond Vanilla spark
>
>
>    Ryan:
>
>
>    - Arrow version should not tied to spark and have iceberg specific
>    implementation binding so it will work with any reader not just spark.
>    - Add DatasourceV2Strategy to handle nested pruning into Spark
>    upstream. Will coordinate with Apple folks to add their work into Spark.
>    - Need ability to fallback  to row based reads for cases where
>    columnar isn't possible. A config option maybe.
>    - Can add options where columnar batches are read into InternalRow and
>    returned to the Datasource.
>
>   Padma:
>
>    - Possibly contribute work on arrow back to arrow project. (can punt
>    on this for now to move forward faster on current work)
>    - Was looking into complex type support for Arrow based reads.
>
>
> V1 Vectorized Read Path [2]
>
> Gautam:
>
>    - Been working on V1 vectorized short circuit read path [3]. (this is
>    prolly not as useful once we have full featured support on Arrow based
>    reads)
>    - Will work on getting schema evolution parts working with this reader
>    by getting Projection unit/integration tests working. (this can be
>    contributed back into iceberg repo to unblock this path if we want to have
>    that option till arrow based read is fully capable)
>
>
>
> *Next steps:*
>
>    - Unit tests for current Arrow based work.
>    - Provide options to perform vectorized batch reads, Row oriented
>    reads and Internal Row over Batch reads.
>    - Separate Arrow work in Iceberg into it's own sub-module
>    - Dictionary encoding support for strings in Arrow.
>    - Complex type support for Arrow.
>    - File issues for the above and identify how to distribute work
>    between us.
>
>
>
>
> [1]  https://github.com/apache/incubator-iceberg/tree/vectorized-read
>
> [2]  https://github.com/apache/incubator-iceberg/pull/462
>
> [3]
> https://github.com/prodeezy/incubator-iceberg/commits/v1-vectorized-reader
>
>
>

Reply via email to