Re: Consuming delta from Hive tables

Jesus Camacho Rodriguez Mon, 06 May 2019 11:11:12 -0700

Hi Bhargav,

We solve a similar problem for incremental maintenance
<https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/views/HiveAugmentMaterializationRule.java>
for materialized views.

row__id.writeid can be used for that scenario indeed. You just need to know
the current snapshot of the system at reading time (<high_watermark, list
of open transactions>). Then you just need to add a filter operator on top
of that table, making explicit the data contained in it. The filter will
roughly take the form ROW_ID.writeid <= high_watermark and ROW_ID.writeid
not in (open/invalid_ids). Information about how "writeid" is generated can
be found in https://jira.apache.org/jira/browse/HIVE-18192 .

Note that when source tables are not append only and update/delete record
operations may have been executed over them, problem becomes trickier since
currently there is no way to retrieve update/delete records from the delta
files (contributions would be welcome).

Cheers,
Jesús

On Mon, May 6, 2019 at 6:23 AM Bhargav Bipinchandra Naik (Seller
Platform-BLR) <bhargav.n...@flipkart.com> wrote:

> We have a scenario where we want to consume only delta updates from Hive
> tables.
> - Multiple producers are updating data in Hive table
> - Multiple consumer reading data from the Hive table
>
> Consumption pattern:
> - Get all data that has been updated since last time I read.
>
> Is there any mechanism in Hive 3.0 which can enable above consumption
> pattern?
>
> I see there is a construct of row__id(writeid, bucketid, rowid) in ACID
> tables.
> - Can row__id be used in this scenario?
> - How is the "writeid" generated?
> - Is there some meta information which captures the time when the rows
> were actually visible for read?
>

Re: Consuming delta from Hive tables

Reply via email to