[ 
https://issues.apache.org/jira/browse/HIVE-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052903#comment-16052903
 ] 

Eugene Koifman edited comment on HIVE-16177 at 6/19/17 1:49 AM:
----------------------------------------------------------------

The file list is sorted to make sure there is consistent ordering for both read 
and compact.
Compaction needs to process the whole list of files (for a bucket) and assign 
ROW_IDs consistently.
For read, OrcRawRecordReader just has a split from some file.  So I need to 
make sure order them the same way so that the "offset" for the current file is 
computed the same way as for compaction.

Since Hive doesn't restrict the layout of files in a table very well, sorting 
is the most general way to do this.
For example, say we realize that some "feature" places bucket files in 
subdirectories - by sorting the whole list of "original" files it makes this 
work with any directory layout.

Same goes for when we allow non-bucketed tables - files can be anywhere and 
they need to be "numbered" consistently.  Sorting seems like the simplest way 
to do this.

Putting a Comparator in AcidUtils makes sense.

"totalSize" is probably because I run the tests on Mac.  Stats often differ on 
Mac.



was (Author: ekoifman):
The file list is sorted to make sure there is consistent ordering for both read 
and compact.
Compaction needs to process the whole list of files (for a bucket) and assign 
ROW_IDs consistently.
For read, OrcRawRecordReader just has a split from some file.  So I need to 
make sure order them the same way so that the "offset" for the current file is 
computed the same way as for compaction.

Since Hive doesn't restrict the layout of files in a table very well, sorting 
is the most general way to do this.
For example, say we realize that some "feature" places bucket files in 
subdirectories - by sorting the whole list of "original" files it makes this 
work with any directory layout.

Putting a Comparator in AcidUtils makes sense.

"totalSize" is probably because I run the tests on Mac.  Stats often differ on 
Mac.


> non Acid to acid conversion doesn't handle _copy_N files
> --------------------------------------------------------
>
>                 Key: HIVE-16177
>                 URL: https://issues.apache.org/jira/browse/HIVE-16177
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 0.14.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Blocker
>         Attachments: HIVE-16177.01.patch, HIVE-16177.02.patch, 
> HIVE-16177.04.patch, HIVE-16177.07.patch, HIVE-16177.08.patch, 
> HIVE-16177.09.patch, HIVE-16177.10.patch, HIVE-16177.11.patch, 
> HIVE-16177.14.patch, HIVE-16177.15.patch
>
>
> {noformat}
> create table T(a int, b int) clustered by (a)  into 2 buckets stored as orc 
> TBLPROPERTIES('transactional'='false')
> insert into T(a,b) values(1,2)
> insert into T(a,b) values(1,3)
> alter table T SET TBLPROPERTIES ('transactional'='true')
> {noformat}
>     //we should now have bucket files 000001_0 and 000001_0_copy_1
> but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can 
> be copy_N files and numbers rows in each bucket from 0 thus generating 
> duplicate IDs
> {noformat}
> select ROW__ID, INPUT__FILE__NAME, a, b from T
> {noformat}
> produces 
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0,1,2
> {"transactionid\":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0_copy_1,1,3
> {noformat}
> [~owen.omalley], do you have any thoughts on a good way to handle this?
> attached patch has a few changes to make Acid even recognize copy_N but this 
> is just a pre-requisite.  The new UT demonstrates the issue.
> Futhermore,
> {noformat}
> alter table T compact 'major'
> select ROW__ID, INPUT__FILE__NAME, a, b from T order by b
> {noformat}
> produces 
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0}    
> file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands....warehouse/nonacidorctbl/base_-9223372036854775808/bucket_00001
>     1       2
> {noformat}
> HIVE-16177.04.patch has TestTxnCommands.testNonAcidToAcidConversion0() 
> demonstrating this
> This is because compactor doesn't handle copy_N files either (skips them)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to