[ https://issues.apache.org/jira/browse/HIVE-26147?focusedWorklogId=758222&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-758222 ]
ASF GitHub Bot logged work on HIVE-26147: ----------------------------------------- Author: ASF GitHub Bot Created on: 19/Apr/22 02:39 Start Date: 19/Apr/22 02:39 Worklog Time Spent: 10m Work Description: amansinha100 commented on code in PR #3219: URL: https://github.com/apache/hive/pull/3219#discussion_r852545241 ########## ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java: ########## @@ -763,15 +763,20 @@ private KeyInterval discoverOriginalKeyBounds(Reader reader, int bucket, } /** - * Find the key range for the split (of the base). These are used to filter delta files since - * both are sorted by key. + * Find the key range for the split (of the base) based on the 'hive.acid.key.index' metadata. + * These keys are used to filter delta files since both are sorted by key. + * If 'hive.acid.key.index' is missing from the ORC file, return null keys (which forces a full read). * @param reader the reader * @param options the options for reading with - * @throws IOException */ - private KeyInterval discoverKeyBounds(Reader reader, - Reader.Options options) throws IOException { - RecordIdentifier[] keyIndex = OrcRecordUpdater.parseKeyIndex(reader); + private KeyInterval discoverKeyBounds(Reader reader, Reader.Options options) { + final RecordIdentifier[] keyIndex = OrcRecordUpdater.parseKeyIndex(reader); + if (keyIndex == null) { + LOG.warn("Missing '{}' metadata in ORC acid file, can't compute min/max keys", Review Comment: nit: 'Instead of ORC 'acid' file, suggest just using ORC file since the table has acid property, not the file. Issue Time Tracking ------------------- Worklog Id: (was: 758222) Time Spent: 20m (was: 10m) > OrcRawRecordMerger throws NPE when hive.acid.key.index is missing for an acid > file > ---------------------------------------------------------------------------------- > > Key: HIVE-26147 > URL: https://issues.apache.org/jira/browse/HIVE-26147 > Project: Hive > Issue Type: Bug > Components: ORC, Transactions > Affects Versions: 4.0.0-alpha-2 > Reporter: Alessandro Solimando > Assignee: Alessandro Solimando > Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When _hive.acid.key.index_ is missing for an acid ORC file > _OrcRawRecordMerger_ throws as follows: > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.discoverKeyBounds(OrcRawRecordMerger.java:795) > ~[hive-exec-4.0.0-alpha-2-SNAPS > HOT.jar:4.0.0-alpha-2-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:1053) > ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4. > 0.0-alpha-2-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:2096) > ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-a > lpha-2-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1991) > ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4 > .0.0-alpha-2-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:769) > ~[hive-exec-4.0.0-alpha > -2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:335) > ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0- > alpha-2-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560) > ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha > -2-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:529) > ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2- > SNAPSHOT] > at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) > ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] > at > org.apache.hadoop.hive.ql.Driver.getFetchingTableResults(Driver.java:719) > ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNA > PSHOT] > at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:671) > ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] > at > org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:233) > ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha > -2-SNAPSHOT] > at > org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:489) > ~[hive-service-4.0.0-alpha-2-SNAPSHOT.jar: > 4.0.0-alpha-2-SNAPSHOT] > ... 24 more > {noformat} > For this situation to happen, the ORC file must have more than one stripe, > and the offset of the element to seek should either locate it beyond the > first stripe (but before the last one), or in the first one if not the last > one, as the code shows: > {code:java} > if (firstStripe != 0) { > minKey = keyIndex[firstStripe - 1]; > } > if (!isTail) { > maxKey = keyIndex[firstStripe + stripeCount - 1]; > } > {code} > However, in the context of the detection of the original issue, the NPE was > triggered even by a simple "select *" over a table with ORC files missing the > _hive.acid.key.index_ metadata information, but it was never failing for ORC > files with a single stripe. The file was generated after a major compaction > of acid and non-acid data. > If the "select *" is not triggering the NPE, either pick the values of the > row obtained with "select * from $table limit 1", or try to select based on > different values trying to get into the sought situation with a filter like > this: > {code:sql} > select * from $table where c = $value > {code} > _OrcRawRecordMerger_ should simply leave as "null" the min and max keys when > the _hive.acid.key.index_ metadata is missing. -- This message was sent by Atlassian Jira (v8.20.1#820001)