[ https://issues.apache.org/jira/browse/HIVE-26584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612407#comment-17612407 ]
John Sherman edited comment on HIVE-26584 at 10/3/22 6:19 PM: -------------------------------------------------------------- After digging in deeper - You are correct, it is not a concurrent issue. It just happened to be the easiest way to repro and I mistakenly thought it was the root of the issue (before we had the containerized ptest framework, test conflicts were somewhat common iirc). Here is what is what I think is happening: 1. During PR testing TestMiniLlapLocalCliDriver tests get split into 32 different splits [https://github.com/apache/hive/blob/master/itests/bin/generate-cli-splits.sh] [https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestMiniLlapLocalCliDriver.java#L39] (It codegens 32 new TestMiniLlapLocalCliDriver objects each with split0 - split32 in the package name) 2. Test assignment for each split is handled via runtime introspection of the class name: [https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestMiniLlapLocalCliDriver.java#L43] [https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/util/src/main/java/org/apache/hadoop/hive/cli/control/SplitSupport.java#L46] in my PRs case: empty_skip_header_footer_aggr.q gets assigned to split-7: {code:java} <testcase name="testCliDriver[empty_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver" time="2.534"/> {code} compressed_skip_header_footer_aggr.q gets assigned to split-4: {code:java} <testcase name="testCliDriver[compressed_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver" time="7.242"> {code} 3. All test splits are split across 20 executors (not sure where this lives, maybe Jenkins scripts) split-7 and split-4 get assigned to the same "execution split" of 14 {code:java} split-14/itests/qtest/target/surefire-reports/TEST-org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver.xml 144: <testcase name="testCliDriver[empty_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver" time="2.534"/> split-14/itests/qtest/target/surefire-reports/TEST-org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver.xml 165: <testcase name="testCliDriver[compressed_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver" time="7.242"> {code} 4. empty_skip_header_footer_aggr gets executed before compressed_skip_header_footer_aggr (this can be seen above in that 144 is before 165 in the test xml) 5. Both empty_skip_header_footer_aggr and compressed_skip_header_footer_aggr create external tables with the data copied to the same location(s). For example these locations get used in both tests: ${system:test.tmp.dir}/testcase1 ${system:test.tmp.dir}/testcase2 since each test invocation ends up using the same path and the tmp directory is not cleaned between tests this is where the conflict occurs. 6. empty_skip_header_footer_aggr includes rmr commands to cleanup the testcase1 and testcase2 directories. [https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/ql/src/test/queries/clientpositive/empty_skip_header_footer_aggr.q#L6] compressed_skip_header does not: [https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/ql/src/test/queries/clientpositive/compressed_skip_header_footer_aggr.q#L1] This also like explains why it is not reproducible via: {code:java} mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=compressed_skip_header_footer_aggr.q,empty_skip_header_footer_aggr.q {code} I think the order of the tests when executed this way is always compressed_skip_header_footer_aggr.q and then empty_skip_header_footer_aggr.q My fix ends up working because I give a unique location for each tests test external data files. I'll likely modify empty_skip_header_footer_aggr.q to remove the rmr's (because the only thing it really does is to hide this problem) and give all the files/directories unique names. We could like add a "unique external directory" variable that is generated per testcase and cleaned up after each one (or some other solution) but I think that is out of the scope of this ticket. was (Author: jfs): After digging in deeper - You are correct, it is not a concurrent issue. It just happened to be the easiest way to repro and I mistakenly thought it was the root of the issue (before we had the containerized ptest framework, test conflicts were somewhat common iirc). Here is what is what I think is happening: 1. During PR testing TestMiniLlapLocalCliDriver tests get split into 32 different splits [https://github.com/apache/hive/blob/master/itests/bin/generate-cli-splits.sh] [https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestMiniLlapLocalCliDriver.java#L39] (It codegens 32 new TestMiniLlapLocalCliDriver objects each with split0 - split32 in the package name) 2. Test assignment for each split is handled via runtime introspection of the class name: [https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestMiniLlapLocalCliDriver.java#L43] [https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/util/src/main/java/org/apache/hadoop/hive/cli/control/SplitSupport.java#L46] in my PRs case: empty_skip_header_footer_aggr.q gets assigned to split-7: {code:java} <testcase name="testCliDriver[empty_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver" time="2.534"/> {code} compressed_skip_header_footer_aggr.q gets assigned to split-4: {code:java} <testcase name="testCliDriver[compressed_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver" time="7.242"> {code} 3. All test splits are split across 20 executors (not sure where this lives, maybe Jenkins scripts) split-7 and split-4 get assigned to the same "execution split" of 14 {code:java} split-14/itests/qtest/target/surefire-reports/TEST-org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver.xml 144: <testcase name="testCliDriver[empty_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver" time="2.534"/> split-14/itests/qtest/target/surefire-reports/TEST-org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver.xml 165: <testcase name="testCliDriver[compressed_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver" time="7.242"> {code} 4. empty_skip_header_footer_aggr gets executed before compressed_skip_header_footer_aggr (this can be seen above in that 144 is before 165 in the test xml) 5. Both empty_skip_header_footer_aggr and compressed_skip_header_footer_aggr create external tables with the data copied to the same location(s). For example these locations get used in both tests: ${system:test.tmp.dir}/testcase1 ${system:test.tmp.dir}/testcase2 since each test invocation ends up using the same path and the tmp directory is not cleaned between tests this is where the conflict occurs. 6. empty_skip_header_footer_aggr includes rmr commands to cleanup the testcase1 and testcase2 directories. [https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/ql/src/test/queries/clientpositive/empty_skip_header_footer_aggr.q#L6] compressed_skip_header does not: [https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/ql/src/test/queries/clientpositive/compressed_skip_header_footer_aggr.q#L1] This also like explains why it is not reproducible via: {code:java} mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=compressed_skip_header_footer_aggr.q,empty_skip_header_footer_aggr.q {code} I think the order of the tests when executed this way is always compressed_skip_header_footer_aggr.q and then empty_skip_header_footer_aggr.q My fix ends up working because I give a unique location for each tests test external data files. I'll likely modify empty_skip_header_footer_aggr.q to remove the rmr's (because the only thing the do is to hide this problem) and give all the files/directories unique names. We could like add a "unique external directory" variable that is generated per testcase and cleaned up after each one (or some other solution) but I think that is out of the scope of this ticket. > compressed_skip_header_footer_aggr.q is flaky > --------------------------------------------- > > Key: HIVE-26584 > URL: https://issues.apache.org/jira/browse/HIVE-26584 > Project: Hive > Issue Type: Bug > Components: HiveServer2 > Affects Versions: 4.0.0-alpha-2 > Reporter: John Sherman > Assignee: John Sherman > Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > One of my PRs compressed_skip_header_footer_aggr.q was failing with > unexpected diff. Such as: > {code:java} > TestMiniLlapLocalCliDriver.testCliDriver:62 Client Execution succeeded but > contained differences (error code = 1) after executing > compressed_skip_header_footer_aggr.q > 69,71c69,70 > < 1 2019-12-31 > < 2 2018-12-31 > < 3 2017-12-31 > --- > > 2 2019-12-31 > > 3 2019-12-31 > 89d87 > < NULL NULL > 91c89 > < 2 2018-12-31 > --- > > 2 2019-12-31 > 100c98 > < 1 > --- > > 2 > 109c107 > < 1 2019-12-31 > --- > > 2 2019-12-31 > 127,128c125,126 > < 1 2019-12-31 > < 3 2017-12-31 > --- > > 2 2019-12-31 > > 3 2019-12-31 > 146a145 > > 2 2019-12-31 > 155c154 > < 1 > --- > > 2 {code} > Investigating it, it did not seem to fail when executed locally. Since I > suspected test interference I searched for the tablenames/directories used > and discovered empty_skip_header_footer_aggr.q which uses the same table > names AND external directories. -- This message was sent by Atlassian Jira (v8.20.10#820010)