[ https://issues.apache.org/jira/browse/HIVE-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HIVE-25258: ---------------------------------- Labels: pull-request-available (was: ) > Incorrect row order after query-based MINOR compaction > ------------------------------------------------------ > > Key: HIVE-25258 > URL: https://issues.apache.org/jira/browse/HIVE-25258 > Project: Hive > Issue Type: Bug > Components: Transactions > Reporter: Marta Kuczora > Assignee: Marta Kuczora > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The query based MINOR compaction uses the following sorting order in its > inner query: `bucket`, `originalTransaction`, `rowId`, as it can be seen in > the > [code|https://github.com/apache/hive/blob/d0bbe76ad626244802d062b0a93a9f1cd4fc5f20/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactionQueryBuilder.java#L474-L476]. > But actually the rows should be ordered by originalTransactionId, > bucketProperty and rowId, otherwise the delete deltas cannot be applied > correctly. And this is the order what the MR MAJOR and MR MINOR compactions > write. > The sorting order used by the query-based MINOR compaction can lead to > duplicated rows when running the compaction after multiple merge statements. > This issue can be reproduced for example by running the following queries: > {noformat} > CREATE TABLE transactions(id int,value string) STORED AS ORC TBLPROPERTIES > ('transactional'='true'); > INSERT INTO transactions VALUES > (1, 'value_01'),(2, 'value_02'),(3, 'value_03'),(4, 'value_04'),(5, > 'value_05'),(6, 'value_06'),(7, 'value_07'),(8, 'value_08'); > CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC; > INSERT INTO merge_source_1 VALUES (1, 'newvalue_1'),(2, 'newvalue_2'),(4, > 'newvalue_4'),(6, 'newvalue_6'),(9, 'value_9'),(10, 'value_10'),(11, > 'value_11'),(12, 'value_12'); > MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID > WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET > value = S.value > WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value); > CREATE TABLE merge_source_2(ID int, value string) STORED AS ORC; > INSERT INTO merge_source_2 VALUES > (2, 'newestvalue_2'),(4, 'newestvalue_4'),(6, 'newestvalue_6'),(10, > 'newestvalue_10'),(11, 'newestvalue_11'),(13, 'value_13'),(14, 'value_14'); > MERGE INTO transactions AS T > USING merge_source_2 AS S > ON T.ID = S.ID > WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET > value = S.value > WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value); > ALTER TABLE transactions COMPACT 'MINOR'; > CREATE TABLE merge_source_3(ID int, value string) STORED AS ORC; > INSERT INTO merge_source_3 VALUES > (1, 'latestvalue_1'),(4, 'latestvalue_4'),(5, 'latestvalue_5'),(9, > 'latestvalue_9'),(11, 'latestvalue_11'),(13, 'latestvalue_13'),(15, > 'value_15'); > MERGE INTO transactions AS T > USING merge_source_3 AS S > ON T.ID = S.ID > WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET > value = S.value > WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value); > ALTER TABLE transactions COMPACT 'MINOR'; > {noformat} > Running a select after the second compaction finished will return duplicated > rows: > {noformat} > select * from transactions order by id; > +------------------+---------------------+ > | transactions.id | transactions.value | > +------------------+---------------------+ > | 1 | newvalue_1 | > | 1 | latestvalue_1 | > | 2 | newestvalue_2 | > | 2 | newvalue_2 | > | 3 | value_03 | > | 4 | latestvalue_4 | > | 4 | newvalue_4 | > | 5 | latestvalue_5 | > | 6 | newvalue_6 | > | 6 | newestvalue_6 | > | 7 | value_07 | > | 8 | value_08 | > | 9 | latestvalue_9 | > | 10 | newestvalue_10 | > | 11 | latestvalue_11 | > | 12 | value_12 | > | 13 | latestvalue_13 | > | 14 | value_14 | > | 15 | value_15 | > +------------------+---------------------+ > {noformat} > If the same queries are run with MR MINOR compaction, instead of the > query-based MINOR compaction, the select will return the correct result: > {noformat} > +------------------+---------------------+ > | transactions.id | transactions.value | > +------------------+---------------------+ > | 1 | latestvalue_1 | > | 2 | newestvalue_2 | > | 3 | value_03 | > | 4 | latestvalue_4 | > | 5 | latestvalue_5 | > | 6 | newestvalue_6 | > | 7 | value_07 | > | 8 | value_08 | > | 9 | latestvalue_9 | > | 10 | newestvalue_10 | > | 11 | latestvalue_11 | > | 12 | value_12 | > | 13 | latestvalue_13 | > | 14 | value_14 | > | 15 | value_15 | > +------------------+---------------------+ > {noformat} > The content of the bucket files in the delta and delete delta directories > after the query-based and MR compactions look like this. > Query-based > {noformat} > Processing data file > tmp/transactions/delta_0000001_0000004_v0000429/bucket_00000 [length: 947] > {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":0,"currenttransaction":1,"row":{"id":1,"value":"value_01"}} > {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":1,"currenttransaction":1,"row":{"id":2,"value":"value_02"}} > {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":2,"currenttransaction":1,"row":{"id":3,"value":"value_03"}} > {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":3,"currenttransaction":1,"row":{"id":4,"value":"value_04"}} > {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":4,"currenttransaction":1,"row":{"id":5,"value":"value_05"}} > {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":5,"currenttransaction":1,"row":{"id":6,"value":"value_06"}} > {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":6,"currenttransaction":1,"row":{"id":7,"value":"value_07"}} > {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":7,"currenttransaction":1,"row":{"id":8,"value":"value_08"}} > {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":0,"currenttransaction":2,"row":{"id":9,"value":"value_9"}} > {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":1,"currenttransaction":2,"row":{"id":10,"value":"value_10"}} > {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":2,"currenttransaction":2,"row":{"id":11,"value":"value_11"}} > {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":3,"currenttransaction":2,"row":{"id":12,"value":"value_12"}} > {"operation":0,"originaltransaction":3,"bucket":536870912,"rowid":0,"currenttransaction":3,"row":{"id":13,"value":"value_13"}} > {"operation":0,"originaltransaction":3,"bucket":536870912,"rowid":1,"currenttransaction":3,"row":{"id":14,"value":"value_14"}} > {"operation":0,"originaltransaction":4,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":{"id":15,"value":"value_15"}} > {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":0,"currenttransaction":2,"row":{"id":1,"value":"newvalue_1"}} > {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":1,"currenttransaction":2,"row":{"id":2,"value":"newvalue_2"}} > {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":2,"currenttransaction":2,"row":{"id":4,"value":"newvalue_4"}} > {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":3,"currenttransaction":2,"row":{"id":6,"value":"newvalue_6"}} > {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":0,"currenttransaction":3,"row":{"id":10,"value":"newestvalue_10"}} > {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":1,"currenttransaction":3,"row":{"id":11,"value":"newestvalue_11"}} > {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":2,"currenttransaction":3,"row":{"id":2,"value":"newestvalue_2"}} > {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":3,"currenttransaction":3,"row":{"id":4,"value":"newestvalue_4"}} > {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":4,"currenttransaction":3,"row":{"id":6,"value":"newestvalue_6"}} > {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":0,"currenttransaction":4,"row":{"id":5,"value":"latestvalue_5"}} > {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":1,"currenttransaction":4,"row":{"id":9,"value":"latestvalue_9"}} > {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":2,"currenttransaction":4,"row":{"id":1,"value":"latestvalue_1"}} > {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":3,"currenttransaction":4,"row":{"id":13,"value":"latestvalue_13"}} > {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":4,"currenttransaction":4,"row":{"id":11,"value":"latestvalue_11"}} > {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":5,"currenttransaction":4,"row":{"id":4,"value":"latestvalue_4"}} > Processing data file > tmp/transactions/delete_delta_0000001_0000004_v0000429/bucket_00000 [length: > 713] > {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":0,"currenttransaction":2,"row":null} > {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":1,"currenttransaction":2,"row":null} > {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":3,"currenttransaction":2,"row":null} > {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":4,"currenttransaction":4,"row":null} > {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":5,"currenttransaction":2,"row":null} > {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":null} > {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":1,"currenttransaction":3,"row":null} > {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":2,"currenttransaction":3,"row":null} > {"operation":2,"originaltransaction":3,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":null} > {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":0,"currenttransaction":4,"row":null} > {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":1,"currenttransaction":3,"row":null} > {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":2,"currenttransaction":3,"row":null} > {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":3,"currenttransaction":3,"row":null} > {"operation":2,"originaltransaction":3,"bucket":536870913,"rowid":1,"currenttransaction":4,"row":null} > {"operation":2,"originaltransaction":3,"bucket":536870913,"rowid":3,"currenttransaction":4,"row":null} > _____________________________________________________________________________________________________________________ > {noformat} > MR > {noformat} > Processing data file > tmp/transactions/delta_0000001_0000004_v0000479/bucket_00000 [length: 1002] > {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":1,"row":{"id":1,"value":"value_01"}} > {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":1,"currentTransaction":1,"row":{"id":2,"value":"value_02"}} > {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":2,"currentTransaction":1,"row":{"id":3,"value":"value_03"}} > {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":3,"currentTransaction":1,"row":{"id":4,"value":"value_04"}} > {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":4,"currentTransaction":1,"row":{"id":5,"value":"value_05"}} > {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":5,"currentTransaction":1,"row":{"id":6,"value":"value_06"}} > {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":6,"currentTransaction":1,"row":{"id":7,"value":"value_07"}} > {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":7,"currentTransaction":1,"row":{"id":8,"value":"value_08"}} > {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":2,"row":{"id":9,"value":"value_9"}} > {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":2,"row":{"id":10,"value":"value_10"}} > {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":2,"currentTransaction":2,"row":{"id":11,"value":"value_11"}} > {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":3,"currentTransaction":2,"row":{"id":12,"value":"value_12"}} > {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":0,"currentTransaction":2,"row":{"id":1,"value":"newvalue_1"}} > {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":1,"currentTransaction":2,"row":{"id":2,"value":"newvalue_2"}} > {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":2,"currentTransaction":2,"row":{"id":4,"value":"newvalue_4"}} > {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":3,"currentTransaction":2,"row":{"id":6,"value":"newvalue_6"}} > {"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":{"id":13,"value":"value_13"}} > {"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":{"id":14,"value":"value_14"}} > {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":0,"currentTransaction":3,"row":{"id":10,"value":"newestvalue_10"}} > {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":1,"currentTransaction":3,"row":{"id":11,"value":"newestvalue_11"}} > {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":2,"currentTransaction":3,"row":{"id":2,"value":"newestvalue_2"}} > {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":3,"currentTransaction":3,"row":{"id":4,"value":"newestvalue_4"}} > {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":4,"currentTransaction":3,"row":{"id":6,"value":"newestvalue_6"}} > {"operation":0,"originalTransaction":4,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":{"id":15,"value":"value_15"}} > {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":0,"currentTransaction":4,"row":{"id":5,"value":"latestvalue_5"}} > {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":1,"currentTransaction":4,"row":{"id":9,"value":"latestvalue_9"}} > {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":2,"currentTransaction":4,"row":{"id":1,"value":"latestvalue_1"}} > {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":3,"currentTransaction":4,"row":{"id":13,"value":"latestvalue_13"}} > {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":4,"currentTransaction":4,"row":{"id":11,"value":"latestvalue_11"}} > {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":5,"currentTransaction":4,"row":{"id":4,"value":"latestvalue_4"}} > _____________________________________________________________________________________________________________________ > Processing data file > tmp/transactions/delete_delta_0000001_0000004_v0000479/bucket_00000 [length: > 632] > {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":2,"row":null} > {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":1,"currentTransaction":2,"row":null} > {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":3,"currentTransaction":2,"row":null} > {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":4,"currentTransaction":4,"row":null} > {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":5,"currentTransaction":2,"row":null} > {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":null} > {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":null} > {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":2,"currentTransaction":3,"row":null} > {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":0,"currentTransaction":4,"row":null} > {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":1,"currentTransaction":3,"row":null} > {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":2,"currentTransaction":3,"row":null} > {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":3,"currentTransaction":3,"row":null} > {"operation":2,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":null} > {"operation":2,"originalTransaction":3,"bucket":536870913,"rowId":1,"currentTransaction":4,"row":null} > {"operation":2,"originalTransaction":3,"bucket":536870913,"rowId":3,"currentTransaction":4,"row":null} > _____________________________________________________________________________________________________________________ > {noformat} > It can be seen that when the query-based MINOR compaction was used, the > sorting order of the rows is "bucket, originalTransactionId, rowId". But > when the MR MINOR compaction was used, the order is "originalTransactionId, > bucket, rowId". > The ordering in the query-based compaction has to be fixed to be aligned with > the MR compaction's ordering. -- This message was sent by Atlassian Jira (v8.3.4#803005)