It is merged now to 'vectorized-read' branch now. Thanks Ryan. -Anjali.
On Mon, Aug 12, 2019 at 6:12 PM Anjali Norwood <anorw...@netflix.com> wrote: > Hi Padma, Gautam, All, > > Our (Samarth's and mine) wip vectorized code is here: > https://github.com/anjalinorwood/incubator-iceberg/pull/1. > Dan, can you please merge it to 'vectorized-read' branch when you get a > chance? Thanks! > > regards, > Anjali. > > > > > On Mon, Aug 12, 2019 at 10:49 AM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Li, >> >> You're right that the 10k and similar numbers indicate the batch size. >> >> Scores can be interpreted using the "units" column at the end. In this >> case, seconds per operation, so lower is better. >> >> Error is the measurement error. This indicates confidence that the actual >> rate of execution is, for example, within 0.378 of the average 5.875 >> seconds per operation, so between around 5.50 and 6.25 second per op. >> >> On Sun, Aug 11, 2019 at 7:11 PM timmycheng(程力) <timmych...@tencent.com> >> wrote: >> >>> Thanks for broadcasting! Just have a few questions to better understand >>> the awesome work. >>> >>> >>> >>> Could you give a little more details on the score and error columns? >>> Does error mean every time the query hits a null? >>> >>> Shall I assume 5k/10k means the number of rows? What do we learn from >>> compare to IcebergSourceFlatParquetDataReadBenchmark.readIceberg? Or >>> rather, what numbers are we comparing to? >>> >>> >>> >>> -Li >>> >>> >>> >>> *发件人**: *Anjali Norwood <anorw...@netflix.com> >>> *答复**: *"dev@iceberg.apache.org" <dev@iceberg.apache.org> >>> *日期**: *2019年8月10日 星期六 上午4:47 >>> *收件人**: *Ryan Blue <rb...@netflix.com>, "dev@iceberg.apache.org" < >>> dev@iceberg.apache.org> >>> *抄送**: *Gautam <gautamkows...@gmail.com>, "ppa...@apache.org" < >>> ppa...@apache.org>, Samarth Jain <sj...@netflix.com>, Daniel Weeks < >>> dwe...@netflix.com> >>> *主题**: *Re: Encouraging performance results for Vectorized Iceberg >>> code(Internet mail) >>> >>> >>> >>> Good suggestion Ryan. Added dev@iceberg now. >>> >>> >>> >>> Dev: Please see early vectorized Iceberg performance results a couple >>> emails down. This WIP. >>> >>> >>> >>> thanks, >>> >>> Anjali. >>> >>> >>> >>> On Thu, Aug 8, 2019 at 10:39 AM Ryan Blue <rb...@netflix.com> wrote: >>> >>> Hi everyone, >>> >>> >>> >>> Is it possible to copy the Iceberg dev list when sending these emails? >>> There are other people in the community that are interested, like Palantir. >>> If there isn't anything sensitive then let's try to be more inclusive. >>> Thanks! >>> >>> >>> >>> rb >>> >>> >>> >>> On Wed, Aug 7, 2019 at 10:34 PM Anjali Norwood <anorw...@netflix.com> >>> wrote: >>> >>> Hi Gautam, Padma, >>> We wanted to update you before Gautam takes off for vacation. >>> >>> Samarth and I profiled the code and found the following: >>> Profiling the IcebergSourceFlatParquetDataReadBenchmark (10 files, 10M >>> rows, a single long column) using visualVM shows two places where CPU time >>> can be optimized: >>> 1) Iterator abstractions (triple iterators, page iterators etc) seem to >>> take up quite a bit of time. Not using these iterators or making them >>> 'batched' iterators and moving the reading of the data close to the file >>> should help ameliorate this problem. >>> 2) Current code goes back and forth between definition levels and value >>> reads through the levels of iterators. Quite a bit of CPU time is spent >>> here. Reading a batch of primitive values at once after consulting the >>> definition level should help improve performance. >>> >>> So, we prototyped the code to walk over the definition levels and read >>> corresponding values in batches (read values till we hit a null, then read >>> nulls till we hit values and so on) and made the iterators batched >>> iterators. Here are the results: >>> >>> Benchmark >>> Mode Cnt Score Error Units >>> IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized >>> ss 5 10.247 ± 0.202 s/op >>> *IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized >>> ss 5 3.747 ± 0.206 s/op* >>> >>> *IcebergSourceFlatParquetDataReadBenchmark.readIceberg >>> ss 5 11.286 ± 0.457 s/op* >>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized100k >>> ss 5 6.088 ± 0.324 s/op >>> *IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized10k >>> ss 5 5.875 ± 0.378 s/op* >>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized1k >>> ss 5 6.029 ± 0.387 s/op >>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized5k >>> ss 5 6.106 ± 0.497 s/op >>> >>> >>> >>> Moreover, as I mentioned to Gautam on chat, we prototyped reading the >>> string column as a byte array without decoding it into UTF8 (above changes >>> were not made at the time) and we saw significant performance improvements >>> there (21.18 secs before Vs 13.031 secs with the change). When used along >>> with batched iterators, these numbers should get better. >>> >>> >>> >>> Note that we haven't tightened/profiled the new code yet (we will start >>> on that next). Just wanted to share some early positive results. >>> >>> >>> >>> regards, >>> >>> Anjali. >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Ryan Blue >>> >>> Software Engineer >>> >>> Netflix >>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >