Re: Iceberg reading Parquet files to Arrow format

2022-02-05 Thread russell . spitzer
I’m not sure what your are saying, for our implementation vectorization is the arrow format. That’s how we pass batches to spark in vectorization mode. They cannot be separated in the iceberg code although I guess you could implement another columnar in memory format extending Spark columnar bat

Re: Iceberg reading Parquet files to Arrow format

2022-02-05 Thread Mike Zhang
Thanks Russel! I wonder if the performance gain is mainly from vectorization instead of using arrow format? My understanding of the benefits of using Arrow is to avoid serialization/deserialization. I just got a hard time understanding how Iceberg uses Arrow to get the benefit of that. On Sat, Fe

Re: Iceberg reading Parquet files to Arrow format

2022-02-05 Thread Russell Spitzer
One thing to note is we never go to "RDD" records really, since we are always working the DataFrame API. Spark builds RDDs but expects us to deliver data in one of two ways, row-based

Iceberg reading Parquet files to Arrow format

2022-02-04 Thread Mike Zhang
I am reading the Iceberg code regarding the Parquet reading path and see the Parquet files are red to Arrow format first. I wonder how much performance gain we could have by doing that. Let’s take the example of the Spark application with Iceberg. If the Parquet file is red directly to Spark RDD re