Hello devs,
       As a follow up to
https://github.com/apache/incubator-iceberg/issues/9 I'v been reading
through how Spark does vectorized reading in it's current implementation
which is in DataSource V1 path. Trying to see how we can achieve the same
impact in Iceberg's reading. To start with I want to form an understanding
at a high level of the approach one would need to take to achieve this.
Pardon my ignorance as I'm equally new to Spark codebase as I am to
Iceberg. Please correct me if my understanding is wrong.

So here's what Vectorization seems to be doing for Parquet reading:
- The DataSource scan execution uses ParquetFileFormat to build a
RecordReaderIterator [1] which underneath uses the
VectorizedParquetReaderReader.
- This record reader is used to iterate over entire batches of columns
(ColumnarBatch). The iterator.next() call returns a batch and not just a
row. The interfaces are such that allow an ColumnarBatch to be passed
around as a generic Object. As stated here [2]
- On the scan execution side, there is stage Code Generation that compiles
code that consumes entire batches at time so that physical operators take
advantage of the vectorization feature. So the scanner code is aware that
it's reading columnar batches out of the iterator.


I'm wondering how one should approach this if one is to achieve
Vectorization in Iceberg Reader (DatasourceV2) path. For instance, if
Iceberg Reader was to wrap Arrow or ColumnarBatch behind an
Iterator[InternalRow] interface, it would still not work right? Coz it
seems to me there is a lot more going on upstream in the operator execution
path that would be needed to be done here. It would be great if folks who
are more well-versed with the Spark codebase shed some light on this. In
general, what is the contract needed between V2 DataSourceReader (like
Iceberg) and the operator execution?

thank you,
-Gautam.


[1] -
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L412
[2] -
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/RecordReaderIterator.scala#L29

Reply via email to