Alright, so we are talking about reading Parquet data into ArrowRecordBatches 
and then exposing them as ColumnarBatches in Spark, where Spark ColumnVectors 
actually wrap Arrow FieldVectors, correct?

- Anton

> On 28 May 2019, at 21:24, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> From a performance viewpoint, this isn’t a great solution. The row by row 
> approach will substantially hurt performance compared to the vectorized 
> reader. I’ve seen 30% or more speed up when removing row-by-row access. So 
> putting a row-by-row adapter in the middle of two vectorized representations 
> is pretty costly.
> 
> Iceberg doesn’t impose this requirement, it is how Spark consumes the rows 
> itself, one at a time: 
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala#L138
> 
> By exposing Arrow data as Spark’s ColumnarBatch, we should pick up any 
> benefits from improved execution when Spark is updated.
> 
> 
> On Tue, May 28, 2019 at 12:33 PM Owen O'Malley <owen.omal...@gmail.com> wrote:
> 
> 
> On Fri, May 24, 2019 at 8:28 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> if Iceberg Reader was to wrap Arrow or ColumnarBatch behind an 
> Iterator[InternalRow] interface, it would still not work right? Coz it seems 
> to me there is a lot more going on upstream in the operator execution path 
> that would be needed to be done here.
> 
> There’s already a wrapper to adapt Arrow to ColumnarBatch, as well as an 
> iterator to read a ColumnarBatch as a sequence of InternalRow. That’s what we 
> want to take advantage of. You’re right that the first thing that Spark does 
> it to get each row as InternalRow. But we still get a benefit from 
> vectorizing the data materialization to Arrow itself. Spark execution is not 
> vectorized, but that can be updated in Spark later (I think there’s a 
> proposal).
> 
> From a performance viewpoint, this isn't a great solution. The row by row 
> approach will substantially hurt performance compared to the vectorized 
> reader. I've seen 30% or more speed up when removing row-by-row access. So 
> putting a row-by-row adapter in the middle of two vectorized representations 
> is pretty costly.
> 
> .. Owen
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Reply via email to