Hi Bhargav, We solve a similar problem for incremental maintenance <https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/views/HiveAugmentMaterializationRule.java> for materialized views.
row__id.writeid can be used for that scenario indeed. You just need to know the current snapshot of the system at reading time (<high_watermark, list of open transactions>). Then you just need to add a filter operator on top of that table, making explicit the data contained in it. The filter will roughly take the form ROW_ID.writeid <= high_watermark and ROW_ID.writeid not in (open/invalid_ids). Information about how "writeid" is generated can be found in https://jira.apache.org/jira/browse/HIVE-18192 . Note that when source tables are not append only and update/delete record operations may have been executed over them, problem becomes trickier since currently there is no way to retrieve update/delete records from the delta files (contributions would be welcome). Cheers, Jesús On Mon, May 6, 2019 at 6:23 AM Bhargav Bipinchandra Naik (Seller Platform-BLR) <bhargav.n...@flipkart.com> wrote: > We have a scenario where we want to consume only delta updates from Hive > tables. > - Multiple producers are updating data in Hive table > - Multiple consumer reading data from the Hive table > > Consumption pattern: > - Get all data that has been updated since last time I read. > > Is there any mechanism in Hive 3.0 which can enable above consumption > pattern? > > I see there is a construct of row__id(writeid, bucketid, rowid) in ACID > tables. > - Can row__id be used in this scenario? > - How is the "writeid" generated? > - Is there some meta information which captures the time when the rows > were actually visible for read? >