[jira] [Updated] (HIVE-18098) Add support for Export/Import for Acid tables

Eugene Koifman (JIRA) Thu, 30 Nov 2017 13:11:17 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-18098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eugene Koifman updated HIVE-18098:
----------------------------------
    Description: 
How should this work?
For regular tables export just copies the files under table root to a specified 
directory.
This doesn't make sense for Acid tables:
* Some data may belong to aborted transactons
* Transaction IDs are imbedded into data/files names.  You'd have export delta/ 
and base/ each of which may have files with the same names, e.g. bucket_00000.  
 
* On import these IDs won't make sense in a different cluster or even a 
different table (which may have delta_x_x for example for the same x (but 
different data of course).
* Export creates a _metadata column types, storage format, etc.  Perhaps it can 
include info about aborted IDs (if the whole file can't be skipped).
* Even importing into the same table on the same cluster may be a problem.  For 
example delta_5_5/ existed at the time of export and was included in the 
export.  But 2 days later it may not exist because it was compacted and cleaned.
* If importing back into the same table on the same cluster, the data could be 
imported into a different transaction (assuming per table writeIDs) w/o having 
to remap the IDs in the rows themselves.
* support Import Overwrite?
* Support Import as a new txn with remapping of ROW_IDs?  The new writeID can 
be stored in a delta_x_x/_meta_data and ROW__IDs can be remapped at read time 
(like isOriginal) and made permanent by compaction.
* It doesn't seem reasonable to import acid data into non-acid table
* Perhaps import can work similar to Load Data: look at the file imported, if 
it has Acid columns, leave a note in the delta_x_x/_meta_data to indicate that 
these columns should be skipped a new ROW_IDs assigned at read time.


  was:
How should this work?
For regular tables export just copies the files under table root to a specified 
directory.
This doesn't make sense for Acid tables:
* Some data may belong to aborted transactons
* Transaction IDs are imbedded into data/files names.  You'd have export delta/ 
and base/ each of which may have files with the same names, e.g. bucket_00000.  
 
* On import these IDs won't make sense in a different cluster or even a 
different table (which may have delta_x_x for example for the same x (but 
different data of course).
* Export creates a _metadata column types, storage format, etc.  Perhaps it can 
include info about aborted IDs (if the whole file can't be skipped).
* Even importing into the same table on the same cluster may be a problem.  For 
example delta_5_5/ existed at the time of export and was included in the 
export.  But 2 days later it may not exist because it was compacted and cleaned.
* If importing back into the same table on the same cluster, the data could be 
imported into a different transaction (assuming per table writeIDs) w/o having 
to remap the IDs in the rows themselves.
* support Import Overwrite?
* Support Import as a new txn with remapping of ROW_IDs?  The new writeID can 
be stored in a delta_x_x/_meta_data and ROW__IDs can be remapped at read time 
(like isOriginal) and made permanent by compaction.
* It doesn't seem reasonable to import acid data into non-acid table




> Add support for Export/Import for Acid tables
> ---------------------------------------------
>
>                 Key: HIVE-18098
>                 URL: https://issues.apache.org/jira/browse/HIVE-18098
>             Project: Hive
>          Issue Type: New Feature
>          Components: Transactions
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>
> How should this work?
> For regular tables export just copies the files under table root to a 
> specified directory.
> This doesn't make sense for Acid tables:
> * Some data may belong to aborted transactons
> * Transaction IDs are imbedded into data/files names.  You'd have export 
> delta/ and base/ each of which may have files with the same names, e.g. 
> bucket_00000.   
> * On import these IDs won't make sense in a different cluster or even a 
> different table (which may have delta_x_x for example for the same x (but 
> different data of course).
> * Export creates a _metadata column types, storage format, etc.  Perhaps it 
> can include info about aborted IDs (if the whole file can't be skipped).
> * Even importing into the same table on the same cluster may be a problem.  
> For example delta_5_5/ existed at the time of export and was included in the 
> export.  But 2 days later it may not exist because it was compacted and 
> cleaned.
> * If importing back into the same table on the same cluster, the data could 
> be imported into a different transaction (assuming per table writeIDs) w/o 
> having to remap the IDs in the rows themselves.
> * support Import Overwrite?
> * Support Import as a new txn with remapping of ROW_IDs?  The new writeID can 
> be stored in a delta_x_x/_meta_data and ROW__IDs can be remapped at read time 
> (like isOriginal) and made permanent by compaction.
> * It doesn't seem reasonable to import acid data into non-acid table
> * Perhaps import can work similar to Load Data: look at the file imported, if 
> it has Acid columns, leave a note in the delta_x_x/_meta_data to indicate 
> that these columns should be skipped a new ROW_IDs assigned at read time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-18098) Add support for Export/Import for Acid tables

Reply via email to