Re: Encouraging performance results for Vectorized Iceberg code

Anjali Norwood Thu, 08 Aug 2019 16:47:48 -0700

Yes, will do so early next week if not sooner.

thanks,
Anjali.


On Thu, Aug 8, 2019 at 4:45 PM Gautam Kowshik <gautamkows...@gmail.com>
wrote:

> Thanks Anjali and Samarth,
>    These look good! Great progress.  Can you push your changes to the
> vectorized-read branch please?
>
> Sent from my iPhone
>
> On Aug 8, 2019, at 11:56 AM, Anjali Norwood <anorw...@netflix.com> wrote:
>
> Good suggestion Ryan. Added dev@iceberg now.
>
> Dev: Please see early vectorized Iceberg performance results a couple
> emails down. This WIP.
>
> thanks,
> Anjali.
>
> On Thu, Aug 8, 2019 at 10:39 AM Ryan Blue <rb...@netflix.com> wrote:
>
>> Hi everyone,
>>
>> Is it possible to copy the Iceberg dev list when sending these emails?
>> There are other people in the community that are interested, like Palantir.
>> If there isn't anything sensitive then let's try to be more inclusive.
>> Thanks!
>>
>> rb
>>
>> On Wed, Aug 7, 2019 at 10:34 PM Anjali Norwood <anorw...@netflix.com>
>> wrote:
>>
>>> Hi Gautam, Padma,
>>> We wanted to update you before Gautam takes off for vacation.
>>>
>>> Samarth and I profiled the code and found the following:
>>> Profiling the IcebergSourceFlatParquetDataReadBenchmark (10 files, 10M
>>> rows, a single long column) using visualVM shows two places where CPU time
>>> can be optimized:
>>> 1) Iterator abstractions (triple iterators, page iterators etc) seem to
>>> take up quite a bit of time. Not using these iterators or making them
>>> 'batched' iterators and moving the reading of the data close to the file
>>> should help ameliorate this problem.
>>> 2) Current code goes back and forth between definition levels and value
>>> reads through the levels of iterators. Quite a bit of CPU time is spent
>>> here. Reading a batch of primitive values at once after consulting the
>>> definition level should help improve performance.
>>>
>>> So, we prototyped the code to walk over the definition levels and read
>>> corresponding values in batches (read values till we hit a null, then read
>>> nulls till we hit values and so on) and made the iterators batched
>>> iterators. Here are the results:
>>>
>>> Benchmark
>>>  Mode  Cnt   Score   Error  Units
>>> IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized
>>>  ss    5  10.247 ± 0.202   s/op
>>> *IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized
>>>   ss    5   3.747 ± 0.206   s/op*
>>> *IcebergSourceFlatParquetDataReadBenchmark.readIceberg
>>>         ss     5  11.286 ± 0.457   s/op*
>>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized100k
>>>  ss    5   6.088 ± 0.324   s/op
>>> *IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized10k
>>>   ss    5   5.875 ± 0.378   s/op*
>>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized1k
>>>  ss    5   6.029 ± 0.387   s/op
>>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized5k
>>>  ss    5   6.106 ± 0.497   s/op
>>>
>>>
>>> Moreover, as I mentioned to Gautam on chat, we prototyped reading the
>>> string column as a byte array without decoding it into UTF8 (above changes
>>> were not made at the time) and we saw significant performance improvements
>>> there (21.18 secs before Vs 13.031 secs with the change). When used along
>>> with batched iterators, these numbers should get better.
>>>
>>> Note that we haven't tightened/profiled the new code yet (we will start
>>> on that next). Just wanted to share some early positive results.
>>>
>>> regards,
>>> Anjali.
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: Encouraging performance results for Vectorized Iceberg code

Reply via email to