Re: Encouraging performance results for Vectorized Iceberg code(Internet mail)

Anjali Norwood Tue, 13 Aug 2019 10:37:08 -0700

It is merged now to 'vectorized-read' branch now. Thanks Ryan.

-Anjali.


On Mon, Aug 12, 2019 at 6:12 PM Anjali Norwood <anorw...@netflix.com> wrote:

> Hi Padma, Gautam, All,
>
> Our (Samarth's and mine) wip vectorized code is here:
> https://github.com/anjalinorwood/incubator-iceberg/pull/1.
> Dan, can you please merge it to 'vectorized-read' branch when you get a
> chance? Thanks!
>
> regards,
> Anjali.
>
>
>
>
> On Mon, Aug 12, 2019 at 10:49 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Li,
>>
>> You're right that the 10k and similar numbers indicate the batch size.
>>
>> Scores can be interpreted using the "units" column at the end. In this
>> case, seconds per operation, so lower is better.
>>
>> Error is the measurement error. This indicates confidence that the actual
>> rate of execution is, for example, within 0.378 of the average 5.875
>> seconds per operation, so between around 5.50 and 6.25 second per op.
>>
>> On Sun, Aug 11, 2019 at 7:11 PM timmycheng(程力) <timmych...@tencent.com>
>> wrote:
>>
>>> Thanks for broadcasting! Just have a few questions to better understand
>>> the awesome work.
>>>
>>>
>>>
>>> Could you give a little more details on the score and error columns?
>>> Does error mean every time the query hits a null?
>>>
>>> Shall I assume 5k/10k means the number of rows? What do we learn from
>>> compare to IcebergSourceFlatParquetDataReadBenchmark.readIceberg? Or
>>> rather, what numbers are we comparing to?
>>>
>>>
>>>
>>> -Li
>>>
>>>
>>>
>>> *发件人**: *Anjali Norwood <anorw...@netflix.com>
>>> *答复**: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>
>>> *日期**: *2019年8月10日 星期六 上午4:47
>>> *收件人**: *Ryan Blue <rb...@netflix.com>, "dev@iceberg.apache.org" <
>>> dev@iceberg.apache.org>
>>> *抄送**: *Gautam <gautamkows...@gmail.com>, "ppa...@apache.org" <
>>> ppa...@apache.org>, Samarth Jain <sj...@netflix.com>, Daniel Weeks <
>>> dwe...@netflix.com>
>>> *主题**: *Re: Encouraging performance results for Vectorized Iceberg
>>> code(Internet mail)
>>>
>>>
>>>
>>> Good suggestion Ryan. Added dev@iceberg now.
>>>
>>>
>>>
>>> Dev: Please see early vectorized Iceberg performance results a couple
>>> emails down. This WIP.
>>>
>>>
>>>
>>> thanks,
>>>
>>> Anjali.
>>>
>>>
>>>
>>> On Thu, Aug 8, 2019 at 10:39 AM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>> Hi everyone,
>>>
>>>
>>>
>>> Is it possible to copy the Iceberg dev list when sending these emails?
>>> There are other people in the community that are interested, like Palantir.
>>> If there isn't anything sensitive then let's try to be more inclusive.
>>> Thanks!
>>>
>>>
>>>
>>> rb
>>>
>>>
>>>
>>> On Wed, Aug 7, 2019 at 10:34 PM Anjali Norwood <anorw...@netflix.com>
>>> wrote:
>>>
>>> Hi Gautam, Padma,
>>> We wanted to update you before Gautam takes off for vacation.
>>>
>>> Samarth and I profiled the code and found the following:
>>> Profiling the IcebergSourceFlatParquetDataReadBenchmark (10 files, 10M
>>> rows, a single long column) using visualVM shows two places where CPU time
>>> can be optimized:
>>> 1) Iterator abstractions (triple iterators, page iterators etc) seem to
>>> take up quite a bit of time. Not using these iterators or making them
>>> 'batched' iterators and moving the reading of the data close to the file
>>> should help ameliorate this problem.
>>> 2) Current code goes back and forth between definition levels and value
>>> reads through the levels of iterators. Quite a bit of CPU time is spent
>>> here. Reading a batch of primitive values at once after consulting the
>>> definition level should help improve performance.
>>>
>>> So, we prototyped the code to walk over the definition levels and read
>>> corresponding values in batches (read values till we hit a null, then read
>>> nulls till we hit values and so on) and made the iterators batched
>>> iterators. Here are the results:
>>>
>>> Benchmark
>>>  Mode  Cnt   Score   Error  Units
>>> IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized
>>>  ss    5  10.247 ± 0.202   s/op
>>> *IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized
>>>   ss    5   3.747 ± 0.206   s/op*
>>>
>>> *IcebergSourceFlatParquetDataReadBenchmark.readIceberg
>>>         ss     5  11.286 ± 0.457   s/op*
>>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized100k
>>>  ss    5   6.088 ± 0.324   s/op
>>> *IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized10k
>>>   ss    5   5.875 ± 0.378   s/op*
>>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized1k
>>>  ss    5   6.029 ± 0.387   s/op
>>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized5k
>>>  ss    5   6.106 ± 0.497   s/op
>>>
>>>
>>>
>>> Moreover, as I mentioned to Gautam on chat, we prototyped reading the
>>> string column as a byte array without decoding it into UTF8 (above changes
>>> were not made at the time) and we saw significant performance improvements
>>> there (21.18 secs before Vs 13.031 secs with the change). When used along
>>> with batched iterators, these numbers should get better.
>>>
>>>
>>>
>>> Note that we haven't tightened/profiled the new code yet (we will start
>>> on that next). Just wanted to share some early positive results.
>>>
>>>
>>>
>>> regards,
>>>
>>> Anjali.
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ryan Blue
>>>
>>> Software Engineer
>>>
>>> Netflix
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: Encouraging performance results for Vectorized Iceberg code(Internet mail)

Reply via email to