Re: Iceberg using V1 Vectorized Reader over Parquet ..

2019-11-13 Thread Samarth Jain
I recently pushed support for vectorized reads for dictionary encoded parquet data and wanted to share some benchmark results for string and numeric data types: Dictionary Encoded VARCHAR column Benchmark Cnt Score Error Units VectorizedDictionaryEncodedStringsBenchmark.readFileSourceNonVect

Re: Iceberg using V1 Vectorized Reader over Parquet ..

2019-09-09 Thread Samarth Jain
I wanted to share progress made so far with improving the performance of the Iceberg Arrow vectorized read path. BIGINT column Benchmark Cnt Score Error Units IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized 5 4.642 ± 1.629 s/op IcebergSourceFlatParquetDataReadBenchmar

Re: Iceberg using V1 Vectorized Reader over Parquet ..

2019-09-05 Thread Ryan Blue
Nice work, Gautam! Looks like this could be a useful patch before the Arrow read path is ready to go. It's also good to see the performance between Spark's DataSource v2 and v1. We were wondering if the additional projection added in the v2 path was causing v2 to be slower than v1 due to an extra

Re: Iceberg using V1 Vectorized Reader over Parquet ..

2019-09-05 Thread Gautam
I'v added unit tests and created a PR for the v1 vectorization work : https://github.com/apache/incubator-iceberg/pull/452 I'm sure there's scope for further improvement so lemme know your feedback over the PR so I can sharpen it further. Cheers, -Gautam. On Wed, Sep 4, 2019 at 10:33 PM Mouli M

Re: Iceberg using V1 Vectorized Reader over Parquet ..

2019-09-04 Thread Mouli Mukherjee
Hi Gautam, this is very exciting to see. It would be great if this was available behind a flag if possible. Best, Mouli On Wed, Sep 4, 2019, 7:01 AM Gautam wrote: > Hello Devs, >As some of you know there's been ongoing work as part > of [1] to build Arrow based vectorization

Iceberg using V1 Vectorized Reader over Parquet ..

2019-09-04 Thread Gautam
Hello Devs, As some of you know there's been ongoing work as part of [1] to build Arrow based vectorization into Iceberg. There's a separate thread on this dev list where that is being discussed and progress is being tracked in a separate branch [2]. The overall approach there is