Re: Approaching Vectorized Reading in Iceberg ..

2019-05-24 Thread Gautam
> There’s already a wrapper to adapt Arrow to ColumnarBatch, as well as an iterator to read a ColumnarBatch as a sequence of InternalRow. That’s what we want to take advantage of. You’re right that the first thing that Spark does it to get each row as InternalRow. But we still get a benefit from ve

Re: Approaching Vectorized Reading in Iceberg ..

2019-05-24 Thread Ryan Blue
if Iceberg Reader was to wrap Arrow or ColumnarBatch behind an Iterator[InternalRow] interface, it would still not work right? Coz it seems to me there is a lot more going on upstream in the operator execution path that would be needed to be done here. There’s already a wrapper to adapt Arrow to C

Approaching Vectorized Reading in Iceberg ..

2019-05-24 Thread Gautam
Hello devs, As a follow up to https://github.com/apache/incubator-iceberg/issues/9 I'v been reading through how Spark does vectorized reading in it's current implementation which is in DataSource V1 path. Trying to see how we can achieve the same impact in Iceberg's reading. To start with I

Re: Updates/Deletes/Upserts in Iceberg

2019-05-24 Thread Ryan Blue
Yes, I agree. I'll talk a little about a couple of the constraints of this as well. On Fri, May 24, 2019 at 5:52 AM Anton Okolnychyi wrote: > The agenda looks good to me. I think it would also make sense to clarify > the responsibilities of query engines and Iceberg. Not only in terms of > uniqu

Re: Updates/Deletes/Upserts in Iceberg

2019-05-24 Thread Anton Okolnychyi
The agenda looks good to me. I think it would also make sense to clarify the responsibilities of query engines and Iceberg. Not only in terms of uniqueness, but also in terms of applying diffs on read, for example. > On 23 May 2019, at 01:59, Ryan Blue wrote: > > Here’s a rough agenda: > > U