Carter Shanklin created HIVE-16258:
--------------------------------------
Summary: Suggesting a non-standard extension to MERGE
Key: HIVE-16258
URL: https://issues.apache.org/jira/browse/HIVE-16258
Project: Hive
Issue Type: Improvement
Reporter: Carter Shanklin
Some common data maintenance strategies, especially the Type 2 SCD update,
would become substantially easier with a small extension to the SQL standard
for MERGE, specifically the ability to say "when matched then insert". Per the
standard, matched records can only be updated or deleted.
In the Type 2 SCD, when a new record comes in you update the old version of the
record and insert the new version of the same record. If this extension were
supported, sample Type 2 SCD code would look as follows:
{code}
merge into customer
using new_customer_stage stage
on stage.source_pk = customer.source_pk
when not matched then insert values /* Insert a net new record */
(stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state,
true, null)
when matched then update set /* Update an old record to mark it as
out-of-date */
is_current = false, end_date = current_date()
when matched then insert values /* Insert a new current record */
(stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state,
true, null);
{code}
Without this support, the user needs to devise some sort of workaround. A
common approach is to first left join the staging table against the table to be
updated, then to join these results to a helper table that will spit out two
records for each match and one record for each miss. One of the matching
records needs to have a join key that can never occur in the source data so
this requires precise knowledge of the source dataset.
An example of this:
{code}
merge into customer
using (
select
*,
coalesce(invalid_key, source_pk) as join_key
from (
select
stage.source_pk, stage.name, stage.state,
case when customer.source_pk is null then 1
when stage.name <> customer.name or stage.state <> customer.state then 2
else 0 end as scd_row_type
from
new_customer_stage stage
left join
customer
on (stage.source_pk = customer.source_pk and customer.is_current = true)
) updates
join scd_types on scd_types.type = scd_row_type
) sub
on sub.join_key = customer.source_pk
when matched then update set
is_current = false,
end_date = current_date()
when not matched then insert values
(sub.source_pk, upper(substr(sub.name, 0, 3)), sub.name, sub.state, true,
null);
select * from customer order by source_pk;
{code}
This code is very complicated and will fail if the "invalid" key ever shows up
in the source dataset. This simple extension provides a lot of value and likely
very little maintenance overhead.
/cc [~ekoifman]
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)