[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eugene Koifman updated HIVE-14233: ---------------------------------- Comment: was deleted (was: [~saketj], I left 1 last comment on RB but it's a nit +1 pending tests) > Improve vectorization for ACID by eliminating row-by-row stitching > ------------------------------------------------------------------ > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization > Reporter: Saket Saurabh > Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)