Hi Team,

Currently we are working on enabling vectorization for reading Iceberg tables 
through Hive. This will have serious performance benefit in itself and we would 
like to contribute the code to the Iceberg codebase as well.

Adam Szita created a pull request for it: "Hive: Vectorized ORC reads for Hive 
#2613".
See: https://github.com/apache/iceberg/pull/2613 
<https://github.com/apache/iceberg/pull/2613>

He wrote a quite good summary there.

I could review and merge the code, but we would really value the input from the 
community about the changes.

We have seen that any conversion between data formats is costly and seriously 
hurts the performance
We have taken a look at the Flink / Spark vectorized reads used a middle layer 
between the Readers and the Engines. When we used that approach we found that 
the performance suffered because of the conversion.
Currently the storage-api contains the Hive classes shaded to 
org.apache.orc.storage so Hive can not use them directly. Even though the 
classes are the same we had to manually copy the data which caused performance 
degradation again.

Because of the problems above:
To prevent object conversion we had to reshade the storage-api back to the 
original Hive objects.
Use the org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch as HIVE 
IN_MEMORY_DATA_MODEL

I would like to know what the Iceberg community thinks about the solution, 
especially the contributors and reviewers of the other Vectorization solutions.

Thanks,
Peter

Reply via email to