[ https://issues.apache.org/jira/browse/HIVE-13369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15360131#comment-15360131 ]
Hive QA commented on HIVE-13369: -------------------------------- Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12815793/HIVE-13369.2.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 20 failed/errored test(s), 10287 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_12 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_13 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_multiinsert org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver_vector_complex_all org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver_vector_complex_join org.apache.hadoop.hive.ql.TestTxnCommands2.testNonAcidToAcidConversion3 org.apache.hadoop.hive.ql.io.TestAcidUtils.testBaseDeltas org.apache.hadoop.hive.ql.io.TestAcidUtils.testOriginalDeltas org.apache.hadoop.hive.ql.io.TestAcidUtils.testOverlapingDelta org.apache.hadoop.hive.ql.io.TestAcidUtils.testOverlapingDelta2 org.apache.hadoop.hive.ql.txn.compactor.TestCleaner.cleanupAfterMajorPartitionCompaction org.apache.hadoop.hive.ql.txn.compactor.TestCleaner.cleanupAfterMajorPartitionCompactionNoBase org.apache.hadoop.hive.ql.txn.compactor.TestCleaner.cleanupAfterMajorTableCompaction org.apache.hadoop.hive.ql.txn.compactor.TestCleaner.cleanupAfterMinorPartitionCompaction org.apache.hadoop.hive.ql.txn.compactor.TestCleaner.cleanupAfterMinorTableCompaction org.apache.hadoop.hive.ql.txn.compactor.TestCleaner2.cleanupAfterMajorPartitionCompaction org.apache.hadoop.hive.ql.txn.compactor.TestCleaner2.cleanupAfterMajorPartitionCompactionNoBase org.apache.hadoop.hive.ql.txn.compactor.TestCleaner2.cleanupAfterMajorTableCompaction org.apache.hadoop.hive.ql.txn.compactor.TestCleaner2.cleanupAfterMinorPartitionCompaction org.apache.hadoop.hive.ql.txn.compactor.TestCleaner2.cleanupAfterMinorTableCompaction {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/347/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/347/console Test logs: http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-347/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 20 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12815793 - PreCommit-HIVE-MASTER-Build > AcidUtils.getAcidState() is not paying attention toValidTxnList when choosing > the "best" base file > -------------------------------------------------------------------------------------------------- > > Key: HIVE-13369 > URL: https://issues.apache.org/jira/browse/HIVE-13369 > Project: Hive > Issue Type: Bug > Components: Transactions > Affects Versions: 1.0.0 > Reporter: Eugene Koifman > Assignee: Wei Zheng > Priority: Blocker > Attachments: HIVE-13369.1.patch, HIVE-13369.2.patch > > > The JavaDoc on getAcidState() reads, in part: > "Note that because major compactions don't > preserve the history, we can't use a base directory that includes a > transaction id that we must exclude." > which is correct but there is nothing in the code that does this. > And if we detect a situation where txn X must be excluded but and there are > deltas that contain X, we'll have to abort the txn. This can't (reasonably) > happen with auto commit mode, but with multi statement txns it's possible. > Suppose some long running txn starts and lock in snapshot at 17 (HWM). An > hour later it decides to access some partition for which all txns < 20 (for > example) have already been compacted (i.e. GC'd). > ========================================================== > Here is a more concrete example. Let's say the file for table A are as > follows and created in the order listed. > delta_4_4 > delta_5_5 > delta_4_5 > base_5 > delta_16_16 > delta_17_17 > base_17 (for example user ran major compaction) > let's say getAcidState() is called with ValidTxnList(20:16), i.e. with HWM=20 > and ExceptionList=<16> > Assume that all txns <= 20 commit. > Reader can't use base_17 because it has result of txn16. So it should chose > base_5 "TxnBase bestBase" in _getChildState()_. > Then the reset of the logic in _getAcidState()_ should choose delta_16_16 and > delta_17_17 in _Directory_ object. This would represent acceptable snapshot > for such reader. > The issue is if at the same time the Cleaner process is running. It will see > everything with txnid<17 as obsolete. Then it will check lock manger state > and decide to delete (as there may not be any locks in LM for table A). The > order in which the files are deleted is undefined right now. It may delete > delta_16_16 and delta_17_17 first and right at this moment the read request > with ValidTxnList(20:16) arrives (such snapshot may have bee locked in by > some multi-stmt txn that started some time ago or (at least in theory) by > autoCommit one due to long GC pause for example. It acquires locks after > the Cleaner checks LM state and calls getAcidState(). This request will > choose base_5 but it won't see delta_16_16 and delta_17_17 and thus return > the snapshot w/o modifications made by those txns. > This is a subtle race condition but possible. > 1. So the safest thing to do to ensure correctness is to use the latest > base_x as the "best" and check against exceptions in ValidTxnList and throw > an exception if there is an exception <=x. > 2. A better option is to keep 2 exception lists: aborted and open and only > throw if there is an open txn <=x. Compaction throws away data from aborted > txns and thus there is no harm using base with aborted txns in its range. > 3. You could make each txn record the lowest open txn id at its start and > prevent the cleaner from cleaning anything delta with id range that includes > this open txn id for any txn that is still running. This has a drawback of > potentially delaying GC of old files for arbitrarily long periods. So this > should be a user config choice. The implementation is not trivial. > I would go with 1 now and do 2/3 together with multi-statement txn work. > Side note: if 2 deltas have overlapping ID range, then 1 must be a subset of > the other -- This message was sent by Atlassian JIRA (v6.3.4#6332)