[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

Rajat Venkatesh (JIRA) Wed, 15 Oct 2014 22:00:07 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173373#comment-14173373
 ]


Rajat Venkatesh commented on HIVE-8467:
---------------------------------------

"guaranteed to be the same" is the real bugbear. WRT to managed tables or 
databases, this is a tractable problem. Typically one can augment DML plans to 
keep the materialized views in sync. A mechanism to invalidate views and 
refresh them in the background will also be required. 

When it comes to external tables, the situation is a lot more haphazard. Users 
add files, remove files or rewrite files and expect them to available when they 
query the table. Also data can change in partitions a few days old. For e.g. 
some 3rd party data providers will send corrections after 3 days. In such a 
situation, the only way I can think of to guarantee that a view is synced is by 
scanning the directories. It will be great to hear if others have a better 
plan. So I've avoided the term materialized views to put the onus on the user 
to keep copies of external tables in sync. In that sense, table copy is 
complementary to materialized views. Use materialized views on managed tables 
and table copies on external tables.

Another factor is that we want to make these copies available to other 
execution engines and languages. In our case those are Presto, Pig and M/R. Use 
Hive to manage these copies and read it from others as well. This also means 
that we have to cater to the lowest common denominator. 

>From your description of CBO, I think it should be relatively straight-forward 
>to bring in Table Copies. Can Calcite make decisions at the partition level 
>too ? We would like to handle situations when some partitions are not 
>available in the copy.

> Table Copy - Background, incremental data load
> ----------------------------------------------
>
>                 Key: HIVE-8467
>                 URL: https://issues.apache.org/jira/browse/HIVE-8467
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Rajat Venkatesh
>         Attachments: Table Copies.pdf
>
>
> Traditionally, Hive and other tools in the Hadoop eco-system havent required 
> a load stage. However, with recent developments, Hive is much more performant 
> when data is stored in specific formats like ORC, Parquet, Avro etc. 
> Technologies like Presto, also work much better with certain data formats. At 
> the same time, data is generated or obtained from 3rd parties in non-optimal 
> formats such as CSV, tab-limited or JSON. Many a times, its not an option to 
> change the data format at the source. We've found that users either use 
> sub-optimal formats or spend a large amount of effort creating and 
> maintaining copies. We want to propose a new construct - Table Copy - to help 
> “load” data into an optimal storage format.
> I am going to attach a PDF document with a lot more details especially 
> addressing how is this different from bulk loads in relational DBs or 
> materialized views.
> Looking forward to hear if others see a similar need to formalize conversion 
> of data to different storage formats.  If yes, are the details in the PDF 
> document a good start ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

Reply via email to