[ https://issues.apache.org/jira/browse/HIVE-23758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
László Bodor updated HIVE-23758: -------------------------------- Description: There was a customer case, where a bucket file was somehow placed into a partition directory, which contained another bucket file with valid acid schema (refer [^orc_dump.log] for details), and query failed while split generation with below error at [this line|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L497] {code} Caused by: java.lang.RuntimeException: ORC split generation failed with exception: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6 at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1871) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1959) at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:532) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:789) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) ... 4 more Caused by: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6 at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1865) ... 17 more Caused by: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSargColumnNames(OrcInputFormat.java:482) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.extractNeededColNames(OrcInputFormat.java:539) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.extractNeededColNames(OrcInputFormat.java:534) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.access$2900(OrcInputFormat.java:158) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1556) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2700(OrcInputFormat.java:1337) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1522) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1519) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1519) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1337) at java.util.concurrent.FutureTask.run(FutureTask.java:266) {code} haven't figured out the origin of the second, empty file, but a sanity check of length helped to skip this issue by ignoring that file while split generation, which I'm about to try out in the first version of the pull request: in tez app logs after the patch: {code} 2020-06-24 13:13:21,331 [WARN] [ORC_GET_SPLITS #2] |orc.OrcInputFormat|: possible schema mismatch, asked for column with index:6. column but there is only 6 types defined (isOriginal: false, originalColumnNames.length: 1), cannot get sarg col names... 2020-06-24 13:13:21,331 [WARN] [ORC_GET_SPLITS #2] |orc.OrcInputFormat|: Skipping split elimination for hdfs://ns1/warehouse/tablespace/managed/hive/bdaa28846/cda_date=20200601/cda_job_name=core_base/base_0000001/bucket_00001 as column names is null {code} was: There was a customer case, where a bucket file was somehow placed into a partition directory, which contained another bucket file with valid acid schema (refer [^orc_dump.log] for details), and query failed while split generation with below error at this line {code} Caused by: java.lang.RuntimeException: ORC split generation failed with exception: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6 at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1871) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1959) at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:532) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:789) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) ... 4 more Caused by: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6 at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1865) ... 17 more Caused by: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSargColumnNames(OrcInputFormat.java:482) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.extractNeededColNames(OrcInputFormat.java:539) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.extractNeededColNames(OrcInputFormat.java:534) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.access$2900(OrcInputFormat.java:158) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1556) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2700(OrcInputFormat.java:1337) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1522) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1519) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1519) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1337) at java.util.concurrent.FutureTask.run(FutureTask.java:266) {code} > OrcInputFormat.getSargColumnNames might be more failsafe in case of schema > mismatch > ----------------------------------------------------------------------------------- > > Key: HIVE-23758 > URL: https://issues.apache.org/jira/browse/HIVE-23758 > Project: Hive > Issue Type: Bug > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > Attachments: orc_dump.log > > > There was a customer case, where a bucket file was somehow placed into a > partition directory, which contained another bucket file with valid acid > schema (refer [^orc_dump.log] for details), and query failed while split > generation with below error at [this > line|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L497] > {code} > Caused by: java.lang.RuntimeException: ORC split generation failed with > exception: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6 > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1871) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1959) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:532) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:789) > at > org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) > at > org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) > at > com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) > at > com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) > ... 4 more > Caused by: java.util.concurrent.ExecutionException: > java.lang.IndexOutOfBoundsException: Index: 6, Size: 6 > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1865) > ... 17 more > Caused by: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSargColumnNames(OrcInputFormat.java:482) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.extractNeededColNames(OrcInputFormat.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.extractNeededColNames(OrcInputFormat.java:534) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.access$2900(OrcInputFormat.java:158) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1556) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2700(OrcInputFormat.java:1337) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1522) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1519) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1519) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1337) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > {code} > haven't figured out the origin of the second, empty file, but a sanity check > of length helped to skip this issue by ignoring that file while split > generation, which I'm about to try out in the first version of the pull > request: > in tez app logs after the patch: > {code} > 2020-06-24 13:13:21,331 [WARN] [ORC_GET_SPLITS #2] |orc.OrcInputFormat|: > possible schema mismatch, asked for column with index:6. column but there is > only 6 types defined (isOriginal: false, originalColumnNames.length: 1), > cannot get sarg col names... > 2020-06-24 13:13:21,331 [WARN] [ORC_GET_SPLITS #2] |orc.OrcInputFormat|: > Skipping split elimination for > hdfs://ns1/warehouse/tablespace/managed/hive/bdaa28846/cda_date=20200601/cda_job_name=core_base/base_0000001/bucket_00001 > as column names is null > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)