Thank you Gautam for the summary of the discussion. Hello Devs,
The follow up vectorized iceberg tasks are now captured as issues against the milestone. Listed below for convenience. https://github.com/apache/incubator-iceberg/issues/518 https://github.com/apache/incubator-iceberg/issues/519 https://github.com/apache/incubator-iceberg/issues/520 https://github.com/apache/incubator-iceberg/issues/521 https://github.com/apache/incubator-iceberg/issues/522 thanks, Anjali. On Mon, Oct 7, 2019 at 12:41 PM Gautam <gautamkows...@gmail.com> wrote: > Hello Devs, > We met to discuss progress and next steps on Vectorized > read path in Iceberg. Here are my notes from the sync. Feel free to reply > with clarifications in case I mis-quoted or missed anything. > > *Attendees*: > > Anjali Norwood > Padma Pennumarthy > Ryan Blue > Samarth Jain > Gautam Kowshik > > *Topics * > - Progress on Arrow Based Vectorization Reads > - Features being worked on and possible improvements > - Pending bottlenecks > - Identify things to collaborate on going forward. > - Next steps > > Arrow Vectorized Reader > > Samarth/Anjali: > > - Working on Arrow based vectoization [1] > - At performance parity between Spark and Iceberg on primitive types > except strings. > - Planning to do dictionary encoding on strings > - New Arrow version gives boost in performance and fixes issues > - Vectorized batched Reading of definition levels improves performance > - Some checks had to be turned off in arrow to push performance > further viz. null check, unsafe memory access > - Implemented prefetching of parquet pages, this improves perf on > primitives beyond Vanilla spark > > > Ryan: > > > - Arrow version should not tied to spark and have iceberg specific > implementation binding so it will work with any reader not just spark. > - Add DatasourceV2Strategy to handle nested pruning into Spark > upstream. Will coordinate with Apple folks to add their work into Spark. > - Need ability to fallback to row based reads for cases where > columnar isn't possible. A config option maybe. > - Can add options where columnar batches are read into InternalRow and > returned to the Datasource. > > Padma: > > - Possibly contribute work on arrow back to arrow project. (can punt > on this for now to move forward faster on current work) > - Was looking into complex type support for Arrow based reads. > > > V1 Vectorized Read Path [2] > > Gautam: > > - Been working on V1 vectorized short circuit read path [3]. (this is > prolly not as useful once we have full featured support on Arrow based > reads) > - Will work on getting schema evolution parts working with this reader > by getting Projection unit/integration tests working. (this can be > contributed back into iceberg repo to unblock this path if we want to have > that option till arrow based read is fully capable) > > > > *Next steps:* > > - Unit tests for current Arrow based work. > - Provide options to perform vectorized batch reads, Row oriented > reads and Internal Row over Batch reads. > - Separate Arrow work in Iceberg into it's own sub-module > - Dictionary encoding support for strings in Arrow. > - Complex type support for Arrow. > - File issues for the above and identify how to distribute work > between us. > > > > > [1] https://github.com/apache/incubator-iceberg/tree/vectorized-read > > [2] https://github.com/apache/incubator-iceberg/pull/462 > > [3] > https://github.com/prodeezy/incubator-iceberg/commits/v1-vectorized-reader > > >