[ https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628015#comment-16628015 ]
Eugene Koifman commented on HIVE-16812: --------------------------------------- patch 5 should be ready for review VectorizedOrcAcidRowBatchReader examines the split worth of insert events and based on that generates 2 sets of bounds to use to filter delete events before loading them into in-memory structure. The 1st set is min/max ROW__ID. The 2nd is a SARG to push down to delete_delta files This is used by ColumnizedDeleteEventRegistry but not SortMergedDeleteEventRegistry. A limitation is that it currently doesn't handle {{OrcSplit.isOriginal()}} files. This should be done in a followup after HIVE-17917. [~gopalv] could you review please > VectorizedOrcAcidRowBatchReader doesn't filter delete events > ------------------------------------------------------------ > > Key: HIVE-16812 > URL: https://issues.apache.org/jira/browse/HIVE-16812 > Project: Hive > Issue Type: Improvement > Components: Transactions > Affects Versions: 2.3.0 > Reporter: Eugene Koifman > Assignee: Eugene Koifman > Priority: Critical > Attachments: HIVE-16812.02.patch, HIVE-16812.04.patch, > HIVE-16812.05.patch > > > the c'tor of VectorizedOrcAcidRowBatchReader has > {noformat} > // Clone readerOptions for deleteEvents. > Reader.Options deleteEventReaderOptions = readerOptions.clone(); > // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX > because > // we always want to read all the delete delta files. > deleteEventReaderOptions.range(0, Long.MAX_VALUE); > {noformat} > This is suboptimal since base and deltas are sorted by ROW__ID. So for each > split if base we can find min/max ROW_ID and only load events from delta that > are in [min,max] range. This will reduce the number of delete events we load > in memory (to no more than there in the split). > When we support sorting on PK, the same should apply but we'd need to make > sure to store PKs in ORC index > See {{OrcRawRecordMerger.discoverKeyBounds()}} > {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should > know min/max easily for any file written by {{OrcRecordUpdater}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)