[ https://issues.apache.org/jira/browse/HIVE-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795678#comment-16795678 ]
Abhishek Somani edited comment on HIVE-13479 at 3/19/19 5:01 AM: ----------------------------------------------------------------- [~ekoifman] [~vgumashta] [~gopalv] Do we have any plans to work on this? As I understand it, this might not be a restriction on Insert Only tables as there are no row_ids there, but will need work on Orc readers because of the assumption of records sorted on row_ids for a sort merge of delete events. Is that correct? was (Author: asomani): [~ekoifman] [~vgumashta] [~gopalv] Do we have any plans to work on this? > Relax sorting requirement in ACID tables > ---------------------------------------- > > Key: HIVE-13479 > URL: https://issues.apache.org/jira/browse/HIVE-13479 > Project: Hive > Issue Type: New Feature > Components: Transactions > Affects Versions: 1.2.0 > Reporter: Eugene Koifman > Assignee: Eugene Koifman > Priority: Major > Original Estimate: 160h > Remaining Estimate: 160h > > Currently ACID tables require data to be sorted according to internal primary > key. This is that base + delta files can be efficiently sort/merged to > produce the snapshot for current transaction. > This prevents the user to make the table sorted based on any other criteria > which can be useful. One example is using dynamic partition insert (which > also occurs for update/delete SQL). This may create lots of writers > (buckets*partitions) and tax cluster resources. > The usual solution is hive.optimize.sort.dynamic.partition=true which won't > be honored for ACID tables. > We could rely on hash table based algorithm to merge delta files and then not > require any particular sort on Acid tables. One way to do that is to treat > each update event as an Insert (new internal PK) + delete (old PK). Delete > events are very small since they just need to contain PKs. So the hash table > would just need to contain Delete events and be reasonably memory efficient. > This is a significant amount of work but worth doing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)