Re: Encouraging performance results for Vectorized Iceberg code(Internet mail)

Anjali Norwood Mon, 12 Aug 2019 18:13:07 -0700

Hi Padma, Gautam, All,

Our (Samarth's and mine) wip vectorized code is here:
https://github.com/anjalinorwood/incubator-iceberg/pull/1.
Dan, can you please merge it to 'vectorized-read' branch when you get a
chance? Thanks!


regards,
Anjali.




On Mon, Aug 12, 2019 at 10:49 AM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Li,
>
> You're right that the 10k and similar numbers indicate the batch size.
>
> Scores can be interpreted using the "units" column at the end. In this
> case, seconds per operation, so lower is better.
>
> Error is the measurement error. This indicates confidence that the actual
> rate of execution is, for example, within 0.378 of the average 5.875
> seconds per operation, so between around 5.50 and 6.25 second per op.
>
> On Sun, Aug 11, 2019 at 7:11 PM timmycheng(程力) <timmych...@tencent.com>
> wrote:
>
>> Thanks for broadcasting! Just have a few questions to better understand
>> the awesome work.
>>
>>
>>
>> Could you give a little more details on the score and error columns? Does
>> error mean every time the query hits a null?
>>
>> Shall I assume 5k/10k means the number of rows? What do we learn from
>> compare to IcebergSourceFlatParquetDataReadBenchmark.readIceberg? Or
>> rather, what numbers are we comparing to?
>>
>>
>>
>> -Li
>>
>>
>>
>> *发件人**: *Anjali Norwood <anorw...@netflix.com>
>> *答复**: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>
>> *日期**: *2019年8月10日 星期六 上午4:47
>> *收件人**: *Ryan Blue <rb...@netflix.com>, "dev@iceberg.apache.org" <
>> dev@iceberg.apache.org>
>> *抄送**: *Gautam <gautamkows...@gmail.com>, "ppa...@apache.org" <
>> ppa...@apache.org>, Samarth Jain <sj...@netflix.com>, Daniel Weeks <
>> dwe...@netflix.com>
>> *主题**: *Re: Encouraging performance results for Vectorized Iceberg
>> code(Internet mail)
>>
>>
>>
>> Good suggestion Ryan. Added dev@iceberg now.
>>
>>
>>
>> Dev: Please see early vectorized Iceberg performance results a couple
>> emails down. This WIP.
>>
>>
>>
>> thanks,
>>
>> Anjali.
>>
>>
>>
>> On Thu, Aug 8, 2019 at 10:39 AM Ryan Blue <rb...@netflix.com> wrote:
>>
>> Hi everyone,
>>
>>
>>
>> Is it possible to copy the Iceberg dev list when sending these emails?
>> There are other people in the community that are interested, like Palantir.
>> If there isn't anything sensitive then let's try to be more inclusive.
>> Thanks!
>>
>>
>>
>> rb
>>
>>
>>
>> On Wed, Aug 7, 2019 at 10:34 PM Anjali Norwood <anorw...@netflix.com>
>> wrote:
>>
>> Hi Gautam, Padma,
>> We wanted to update you before Gautam takes off for vacation.
>>
>> Samarth and I profiled the code and found the following:
>> Profiling the IcebergSourceFlatParquetDataReadBenchmark (10 files, 10M
>> rows, a single long column) using visualVM shows two places where CPU time
>> can be optimized:
>> 1) Iterator abstractions (triple iterators, page iterators etc) seem to
>> take up quite a bit of time. Not using these iterators or making them
>> 'batched' iterators and moving the reading of the data close to the file
>> should help ameliorate this problem.
>> 2) Current code goes back and forth between definition levels and value
>> reads through the levels of iterators. Quite a bit of CPU time is spent
>> here. Reading a batch of primitive values at once after consulting the
>> definition level should help improve performance.
>>
>> So, we prototyped the code to walk over the definition levels and read
>> corresponding values in batches (read values till we hit a null, then read
>> nulls till we hit values and so on) and made the iterators batched
>> iterators. Here are the results:
>>
>> Benchmark
>>  Mode  Cnt   Score   Error  Units
>> IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized
>>  ss    5  10.247 ± 0.202   s/op
>> *IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized
>> ss    5   3.747 ± 0.206   s/op*
>>
>> *IcebergSourceFlatParquetDataReadBenchmark.readIceberg
>>       ss     5  11.286 ± 0.457   s/op*
>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized100k
>>  ss    5   6.088 ± 0.324   s/op
>> *IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized10k
>> ss    5   5.875 ± 0.378   s/op*
>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized1k
>>  ss    5   6.029 ± 0.387   s/op
>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized5k
>>  ss    5   6.106 ± 0.497   s/op
>>
>>
>>
>> Moreover, as I mentioned to Gautam on chat, we prototyped reading the
>> string column as a byte array without decoding it into UTF8 (above changes
>> were not made at the time) and we saw significant performance improvements
>> there (21.18 secs before Vs 13.031 secs with the change). When used along
>> with batched iterators, these numbers should get better.
>>
>>
>>
>> Note that we haven't tightened/profiled the new code yet (we will start
>> on that next). Just wanted to share some early positive results.
>>
>>
>>
>> regards,
>>
>> Anjali.
>>
>>
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Encouraging performance results for Vectorized Iceberg code(Internet mail)

Reply via email to