Yes, will do so early next week if not sooner. thanks, Anjali.
On Thu, Aug 8, 2019 at 4:45 PM Gautam Kowshik <gautamkows...@gmail.com> wrote: > Thanks Anjali and Samarth, > These look good! Great progress. Can you push your changes to the > vectorized-read branch please? > > Sent from my iPhone > > On Aug 8, 2019, at 11:56 AM, Anjali Norwood <anorw...@netflix.com> wrote: > > Good suggestion Ryan. Added dev@iceberg now. > > Dev: Please see early vectorized Iceberg performance results a couple > emails down. This WIP. > > thanks, > Anjali. > > On Thu, Aug 8, 2019 at 10:39 AM Ryan Blue <rb...@netflix.com> wrote: > >> Hi everyone, >> >> Is it possible to copy the Iceberg dev list when sending these emails? >> There are other people in the community that are interested, like Palantir. >> If there isn't anything sensitive then let's try to be more inclusive. >> Thanks! >> >> rb >> >> On Wed, Aug 7, 2019 at 10:34 PM Anjali Norwood <anorw...@netflix.com> >> wrote: >> >>> Hi Gautam, Padma, >>> We wanted to update you before Gautam takes off for vacation. >>> >>> Samarth and I profiled the code and found the following: >>> Profiling the IcebergSourceFlatParquetDataReadBenchmark (10 files, 10M >>> rows, a single long column) using visualVM shows two places where CPU time >>> can be optimized: >>> 1) Iterator abstractions (triple iterators, page iterators etc) seem to >>> take up quite a bit of time. Not using these iterators or making them >>> 'batched' iterators and moving the reading of the data close to the file >>> should help ameliorate this problem. >>> 2) Current code goes back and forth between definition levels and value >>> reads through the levels of iterators. Quite a bit of CPU time is spent >>> here. Reading a batch of primitive values at once after consulting the >>> definition level should help improve performance. >>> >>> So, we prototyped the code to walk over the definition levels and read >>> corresponding values in batches (read values till we hit a null, then read >>> nulls till we hit values and so on) and made the iterators batched >>> iterators. Here are the results: >>> >>> Benchmark >>> Mode Cnt Score Error Units >>> IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized >>> ss 5 10.247 ± 0.202 s/op >>> *IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized >>> ss 5 3.747 ± 0.206 s/op* >>> *IcebergSourceFlatParquetDataReadBenchmark.readIceberg >>> ss 5 11.286 ± 0.457 s/op* >>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized100k >>> ss 5 6.088 ± 0.324 s/op >>> *IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized10k >>> ss 5 5.875 ± 0.378 s/op* >>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized1k >>> ss 5 6.029 ± 0.387 s/op >>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized5k >>> ss 5 6.106 ± 0.497 s/op >>> >>> >>> Moreover, as I mentioned to Gautam on chat, we prototyped reading the >>> string column as a byte array without decoding it into UTF8 (above changes >>> were not made at the time) and we saw significant performance improvements >>> there (21.18 secs before Vs 13.031 secs with the change). When used along >>> with batched iterators, these numbers should get better. >>> >>> Note that we haven't tightened/profiled the new code yet (we will start >>> on that next). Just wanted to share some early positive results. >>> >>> regards, >>> Anjali. >>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >