[ https://issues.apache.org/jira/browse/HIVE-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eugene Koifman updated HIVE-16258: ---------------------------------- Issue Type: New Feature (was: Improvement) > Suggestion: simplify type 2 SCDs with this non-standard extension to MERGE > -------------------------------------------------------------------------- > > Key: HIVE-16258 > URL: https://issues.apache.org/jira/browse/HIVE-16258 > Project: Hive > Issue Type: New Feature > Components: Transactions > Affects Versions: 2.2.0 > Reporter: Carter Shanklin > Priority: Major > > Some common data maintenance strategies, especially the Type 2 SCD update, > would become substantially easier with a small extension to the SQL standard > for MERGE, specifically the ability to say "when matched then insert". Per > the standard, matched records can only be updated or deleted. > In the Type 2 SCD, when a new record comes in you update the old version of > the record and insert the new version of the same record. If this extension > were supported, sample Type 2 SCD code would look as follows: > {code} > merge into customer > using new_customer_stage stage > on stage.source_pk = customer.source_pk > when not matched then insert values /* Insert a net new record */ > (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, > true, null) > when matched then update set /* Update an old record to mark it as > out-of-date */ > is_current = false, end_date = current_date() > when matched then insert values /* Insert a new current record */ > (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, > true, null); > {code} > Without this support, the user needs to devise some sort of workaround. A > common approach is to first left join the staging table against the table to > be updated, then to join these results to a helper table that will spit out > two records for each match and one record for each miss. One of the matching > records needs to have a join key that can never occur in the source data so > this requires precise knowledge of the source dataset. > An example of this: > {code} > merge into customer > using ( > select > *, > coalesce(invalid_key, source_pk) as join_key > from ( > select > stage.source_pk, stage.name, stage.state, > case when customer.source_pk is null then 1 > when stage.name <> customer.name or stage.state <> customer.state then 2 > else 0 end as scd_row_type > from > new_customer_stage stage > left join > customer > on (stage.source_pk = customer.source_pk and customer.is_current = true) > ) updates > join scd_types on scd_types.type = scd_row_type > ) sub > on sub.join_key = customer.source_pk > when matched then update set > is_current = false, > end_date = current_date() > when not matched then insert values > (sub.source_pk, upper(substr(sub.name, 0, 3)), sub.name, sub.state, true, > null); > select * from customer order by source_pk; > {code} > This code is very complicated and will fail if the "invalid" key ever shows > up in the source dataset. This simple extension provides a lot of value and > likely very little maintenance overhead. > /cc [~ekoifman] -- This message was sent by Atlassian JIRA (v7.6.3#76005)