[ https://issues.apache.org/jira/browse/HIVE-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797769#comment-16797769 ]
Eugene Koifman commented on HIVE-13479: --------------------------------------- There is no sorting restriction on insert-only ACID tables. Delete event filtering (HIVE-20738) for full-crud tables relies on the fact that data is ordered by ROW__ID. I don't think there is anything that precludes INSERT INTO T .... SORT BY ... for full-crud table That should be enough to make min/max in ORC useful for predicate push-down in a lot of cases. IOW is supported and I think could be used to re-sort the table by any column (and will generate new row_id) but it's currently an operation with X lock. With some work, IOW could run with less strict lock, that allows reads but not any other writes. Compaction that does overwrite would have the same issue which is likely too restrictive. IOW (directly from user or compactor) is also problematic since it will invalidate all result set caches and materialized views. Incidentally, {{hive.optimize.sort.dynamic.partition=true}} was fixed on ACID tables long ago. > Relax sorting requirement in ACID tables > ---------------------------------------- > > Key: HIVE-13479 > URL: https://issues.apache.org/jira/browse/HIVE-13479 > Project: Hive > Issue Type: New Feature > Components: Transactions > Affects Versions: 1.2.0 > Reporter: Eugene Koifman > Assignee: Eugene Koifman > Priority: Major > Original Estimate: 160h > Remaining Estimate: 160h > > Currently ACID tables require data to be sorted according to internal primary > key. This is that base + delta files can be efficiently sort/merged to > produce the snapshot for current transaction. > This prevents the user to make the table sorted based on any other criteria > which can be useful. One example is using dynamic partition insert (which > also occurs for update/delete SQL). This may create lots of writers > (buckets*partitions) and tax cluster resources. > The usual solution is hive.optimize.sort.dynamic.partition=true which won't > be honored for ACID tables. > We could rely on hash table based algorithm to merge delta files and then not > require any particular sort on Acid tables. One way to do that is to treat > each update event as an Insert (new internal PK) + delete (old PK). Delete > events are very small since they just need to contain PKs. So the hash table > would just need to contain Delete events and be reasonably memory efficient. > This is a significant amount of work but worth doing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)