[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

Rajat Venkatesh (JIRA) Thu, 16 Oct 2014 01:41:13 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173534#comment-14173534
 ]


Rajat Venkatesh commented on HIVE-8467:
---------------------------------------

No they dont have to. The databases I know provide both options - sync on user 
input or automatically. I am not confident we can support automatic sync on 
external tables. Since it feels like a big feature gap, I chose a different 
name.

Yes - we also have diffs we would like to contribute in other projects to use 
Table Copy. Since the optimization is at the storage level, its very simple. 
Replace partitions from the table copy when possible.  Directories when it 
comes to Pig or M/R.  If materialized views are chosen, then the optimizers 
have to mature in more or less lock step. 

WRT to retention policy, the common case is to only keep the newest n 
partitions limited by size of the copy. We didnt chose a date range. Sometimes 
the date partition is not the top level one. This is a moving window. If older 
partitions are accessed then it will fall back to reading partitions from the 
Hive Table. 

> Table Copy - Background, incremental data load
> ----------------------------------------------
>
>                 Key: HIVE-8467
>                 URL: https://issues.apache.org/jira/browse/HIVE-8467
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Rajat Venkatesh
>         Attachments: Table Copies.pdf
>
>
> Traditionally, Hive and other tools in the Hadoop eco-system havent required 
> a load stage. However, with recent developments, Hive is much more performant 
> when data is stored in specific formats like ORC, Parquet, Avro etc. 
> Technologies like Presto, also work much better with certain data formats. At 
> the same time, data is generated or obtained from 3rd parties in non-optimal 
> formats such as CSV, tab-limited or JSON. Many a times, its not an option to 
> change the data format at the source. We've found that users either use 
> sub-optimal formats or spend a large amount of effort creating and 
> maintaining copies. We want to propose a new construct - Table Copy - to help 
> “load” data into an optimal storage format.
> I am going to attach a PDF document with a lot more details especially 
> addressing how is this different from bulk loads in relational DBs or 
> materialized views.
> Looking forward to hear if others see a similar need to formalize conversion 
> of data to different storage formats.  If yes, are the details in the PDF 
> document a good start ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

Reply via email to