[jira] [Commented] (HIVE-16832) duplicate ROW__ID possible in multi insert into transactional table

Gopal V (JIRA) Tue, 11 Jul 2017 01:34:00 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081773#comment-16081773
 ]


Gopal V commented on HIVE-16832:
--------------------------------

LGTM - +1.

Minor comments - the bits used for bucket and statement ids are too big and 
misaligned (i.e 3:14:15), gets very hard to debug if looking at hex output 
(instead of raw binary). 

With 4k buckets & 4k statements, (3:1(reserved):12:4(reserved):12), allows the 
hex output to be much more easily read, with 3 hex digits there - also possibly 
those 5 bits can come of some use later. 

This patch is good and we can make the inner loops faster in a later iteration 
as the bucketproperty min-max is actually computed across the whole stripe/file 
(i.e if min==max, then no more checks needed).

Compressed OTID got a bit bigger with this, perhaps it is better to build lists 
per statement id instead of storing it - that extra int will eat up 1 long 
worth of space, but the txn push-down from the main split -> delete deltas 
should ensure we never read too much data into that structure.

> duplicate ROW__ID possible in multi insert into transactional table
> -------------------------------------------------------------------
>
>                 Key: HIVE-16832
>                 URL: https://issues.apache.org/jira/browse/HIVE-16832
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 2.2.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Critical
>         Attachments: HIVE-16832.01.patch, HIVE-16832.03.patch, 
> HIVE-16832.04.patch, HIVE-16832.05.patch, HIVE-16832.06.patch, 
> HIVE-16832.08.patch, HIVE-16832.09.patch, HIVE-16832.10.patch, 
> HIVE-16832.11.patch, HIVE-16832.14.patch, HIVE-16832.15.patch, 
> HIVE-16832.16.patch, HIVE-16832.17.patch, HIVE-16832.18.patch, 
> HIVE-16832.19.patch, HIVE-16832.20.patch, HIVE-16832.20.patch
>
>
> {noformat}
>  create table AcidTablePart(a int, b int) partitioned by (p string) clustered 
> by (a) into 2 buckets stored as orc TBLPROPERTIES ('transactional'='true');
>  create temporary table if not exists data1 (x int);
>  insert into data1 values (1);
>  from data1
>    insert into AcidTablePart partition(p) select 0, 0, 'p' || x
>    insert into AcidTablePart partition(p='p1') select 0, 1
> {noformat}
> Each branch of this multi-insert create a row in partition p1/bucket0 with 
> ROW__ID=(1,0,0).
> The same can happen when running SQL Merge (HIVE-10924) statement that has 
> both Insert and Update clauses when target table has 
> _'transactional'='true','transactional_properties'='default'_  (see 
> HIVE-14035).  This is so because Merge is internally run as a multi-insert 
> statement.
> The solution relies on statement ID introduced in HIVE-11030.  Each Insert 
> clause of a multi-insert is gets a unique ID.
> The ROW__ID.bucketId now becomes a bit packed triplet (format version, 
> bucketId, statementId).
> (Since ORC stores field names in the data file we can't rename 
> ROW__ID.bucketId).
> This ensures that there are no collisions and retains desired sort properties 
> of ROW__ID.
> In particular _SortedDynPartitionOptimizer_ works w/o any changes even in 
> cases where there fewer reducers than buckets.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-16832) duplicate ROW__ID possible in multi insert into transactional table

Reply via email to