I recently pushed support for vectorized reads for dictionary encoded
parquet data and wanted to share some benchmark results for string and
numeric data types:
Dictionary Encoded VARCHAR column
Benchmark
Cnt
Score
Error
Units
VectorizedDictionaryEncodedStringsBenchmark.readFileSourceNonVect
I wanted to share progress made so far with improving the performance of
the Iceberg Arrow vectorized read path.
BIGINT column
Benchmark
Cnt
Score
Error
Units
IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized
5
4.642
± 1.629
s/op
IcebergSourceFlatParquetDataReadBenchmar
Nice work, Gautam! Looks like this could be a useful patch before the Arrow
read path is ready to go.
It's also good to see the performance between Spark's DataSource v2 and v1.
We were wondering if the additional projection added in the v2 path was
causing v2 to be slower than v1 due to an extra
I'v added unit tests and created a PR for the v1 vectorization work :
https://github.com/apache/incubator-iceberg/pull/452
I'm sure there's scope for further improvement so lemme know your feedback
over the PR so I can sharpen it further.
Cheers,
-Gautam.
On Wed, Sep 4, 2019 at 10:33 PM Mouli M
Hi Gautam, this is very exciting to see. It would be great if this was
available behind a flag if possible.
Best,
Mouli
On Wed, Sep 4, 2019, 7:01 AM Gautam wrote:
> Hello Devs,
>As some of you know there's been ongoing work as part
> of [1] to build Arrow based vectorization
Hello Devs,
As some of you know there's been ongoing work as part of
[1] to build Arrow based vectorization into Iceberg. There's a separate
thread on this dev list where that is being discussed and progress is being
tracked in a separate branch [2]. The overall approach there is