[jira] [Commented] (HIVE-16832) duplicate ROW__ID possible in multi insert into transactional table

Gopal V (JIRA) Tue, 11 Jul 2017 13:34:42 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082885#comment-16082885
 ]


Gopal V commented on HIVE-16832:
--------------------------------

bq. Suppose you populate a partition via 100 inserts and 1M rows. So you have 
100 OTIDs.

Yeah, this was an optimization for the possibility that you're doing an "update 
every row" merge which would otherwise cause a massive memory jump in deletes 
(& overflow the 2G limit on arrays).

bq. Perhaps simply relying on the "push down" to delete deltas is enough and we 
are better off just keeping 3 arrays

Yes, it might be better - I've yet to really look into the delete distribution 
for a regular CDC workload. The push-down into deletes is a big win anyway.

Not too worried about the extra size here.

> duplicate ROW__ID possible in multi insert into transactional table
> -------------------------------------------------------------------
>
>                 Key: HIVE-16832
>                 URL: https://issues.apache.org/jira/browse/HIVE-16832
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 2.2.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Critical
>         Attachments: HIVE-16832.01.patch, HIVE-16832.03.patch, 
> HIVE-16832.04.patch, HIVE-16832.05.patch, HIVE-16832.06.patch, 
> HIVE-16832.08.patch, HIVE-16832.09.patch, HIVE-16832.10.patch, 
> HIVE-16832.11.patch, HIVE-16832.14.patch, HIVE-16832.15.patch, 
> HIVE-16832.16.patch, HIVE-16832.17.patch, HIVE-16832.18.patch, 
> HIVE-16832.19.patch, HIVE-16832.20.patch, HIVE-16832.20.patch
>
>
> {noformat}
>  create table AcidTablePart(a int, b int) partitioned by (p string) clustered 
> by (a) into 2 buckets stored as orc TBLPROPERTIES ('transactional'='true');
>  create temporary table if not exists data1 (x int);
>  insert into data1 values (1);
>  from data1
>    insert into AcidTablePart partition(p) select 0, 0, 'p' || x
>    insert into AcidTablePart partition(p='p1') select 0, 1
> {noformat}
> Each branch of this multi-insert create a row in partition p1/bucket0 with 
> ROW__ID=(1,0,0).
> The same can happen when running SQL Merge (HIVE-10924) statement that has 
> both Insert and Update clauses when target table has 
> _'transactional'='true','transactional_properties'='default'_  (see 
> HIVE-14035).  This is so because Merge is internally run as a multi-insert 
> statement.
> The solution relies on statement ID introduced in HIVE-11030.  Each Insert 
> clause of a multi-insert is gets a unique ID.
> The ROW__ID.bucketId now becomes a bit packed triplet (format version, 
> bucketId, statementId).
> (Since ORC stores field names in the data file we can't rename 
> ROW__ID.bucketId).
> This ensures that there are no collisions and retains desired sort properties 
> of ROW__ID.
> In particular _SortedDynPartitionOptimizer_ works w/o any changes even in 
> cases where there fewer reducers than buckets.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-16832) duplicate ROW__ID possible in multi insert into transactional table

Reply via email to