Re: Consuming delta from Hive tables

Bhargav Bipinchandra Naik (Seller Platform-BLR) Fri, 17 May 2019 04:45:04 -0700

Is the following scenario supported?

*timestamp:* t1 < t2 < t3 < t4 < t5 < t6


*w1 -* transaction which updates subset of rows in table T {start_time: t1,
end_time: t5}
*w2 -* transaction which updates subset of rows in table T {start_time: t2,
end_time: t3}
*r1 - *job which reads rows from table T {start_time: t4}
*r2 - *job which reads rows from table T {start_time: t6}

- Is the write_id strictly increasing number across rows?
- Is the write_id a version number per row and not a global construct?
- Will the subset of rows updated by w1 have write_ids greater than
write_ids of row updated by w2?

Say if job r1 consumed the data at t4 had maximum write_id 100.
Will rows updated by job w1 (end_time: t5) always have write_id > 100?

Basically I need some kind of checkpoint using which the next run of the
read job can read only the data updated since the checkpoint.

Thanks,
-Bhargav



On Tue, May 7, 2019 at 6:02 PM Bhargav Bipinchandra Naik (Seller
Platform-BLR) <bhargav.n...@flipkart.com> wrote:

> Hi Jesus and Alan,
>
> Thanks for the prompt reply.
> Had a follow up question.
>
> *timestamp:* t1 < t2 < t3 < t4 < t5 < t6
>
> *w1 -* transaction which updates subset of rows in table T {start_time:
> t1, end_time: t5}
> *w2 -* transaction which updates subset of rows in table T {start_time:
> t2, end_time: t3}
> *r1 - *job which reads rows from table T {start_time: t4}
> *r2 - *job which reads rows from table T {start_time: t6}
>
> - Is the write_id strictly increasing number across rows?
> - Is the write_id a version number per row and not a global construct?
> - Will the subset of rows updated by c1 have write_ids greater than
> write_ids of row updated by c2?
>
> Say if job r1 consumed the data at t4 had maximum write_id 100.
> Will rows updated by job w1 (end_time: t5) always have write_id > 100?
>
> Basically I need some kind of checkpoint using which the next run of the
> read job can read only the data updated since the checkpoint.
>
> Thanks,
> Bhargav
>
> On Mon, May 6, 2019 at 11:39 PM Alan Gates <alanfga...@gmail.com> wrote:
>
>> The other issue is an external system has no ability to control when the
>> compactor is run (it rewrites deltas into the base files and thus erases
>> intermediate states that would interest you).  The mapping of writeids
>> (table specific) to transaction ids (system wide) is also cleaned
>> intermittently, again erasing history.  And there's no way to get the
>> mapping from writeids to transaction ids from outside of Hive.
>>
>> Alan.
>>
>> On Mon, May 6, 2019 at 6:23 AM Bhargav Bipinchandra Naik (Seller
>> Platform-BLR) <bhargav.n...@flipkart.com> wrote:
>>
>>> We have a scenario where we want to consume only delta updates from Hive
>>> tables.
>>> - Multiple producers are updating data in Hive table
>>> - Multiple consumer reading data from the Hive table
>>>
>>> Consumption pattern:
>>> - Get all data that has been updated since last time I read.
>>>
>>> Is there any mechanism in Hive 3.0 which can enable above consumption
>>> pattern?
>>>
>>> I see there is a construct of row__id(writeid, bucketid, rowid) in ACID
>>> tables.
>>> - Can row__id be used in this scenario?
>>> - How is the "writeid" generated?
>>> - Is there some meta information which captures the time when the rows
>>> were actually visible for read?
>>>
>>

Re: Consuming delta from Hive tables

Reply via email to