Re: Consuming delta from Hive tables

Bhargav Bipinchandra Naik (Seller Platform-BLR) Tue, 07 May 2019 05:32:53 -0700

Hi Jesus and Alan,

Thanks for the prompt reply.
Had a follow up question.


*timestamp:* t1 < t2 < t3 < t4 < t5 < t6

*w1 -* transaction which updates subset of rows in table T {start_time: t1,
end_time: t5}
*w2 -* transaction which updates subset of rows in table T {start_time: t2,
end_time: t3}
*r1 - *job which reads rows from table T {start_time: t4}
*r2 - *job which reads rows from table T {start_time: t6}

- Is the write_id strictly increasing number across rows?
- Is the write_id a version number per row and not a global construct?
- Will the subset of rows updated by c1 have write_ids greater than
write_ids of row updated by c2?

Say if job r1 consumed the data at t4 had maximum write_id 100.
Will rows updated by job w1 (end_time: t5) always have write_id > 100?

Basically I need some kind of checkpoint using which the next run of the
read job can read only the data updated since the checkpoint.

Thanks,
Bhargav

On Mon, May 6, 2019 at 11:39 PM Alan Gates <alanfga...@gmail.com> wrote:

> The other issue is an external system has no ability to control when the
> compactor is run (it rewrites deltas into the base files and thus erases
> intermediate states that would interest you).  The mapping of writeids
> (table specific) to transaction ids (system wide) is also cleaned
> intermittently, again erasing history.  And there's no way to get the
> mapping from writeids to transaction ids from outside of Hive.
>
> Alan.
>
> On Mon, May 6, 2019 at 6:23 AM Bhargav Bipinchandra Naik (Seller
> Platform-BLR) <bhargav.n...@flipkart.com> wrote:
>
>> We have a scenario where we want to consume only delta updates from Hive
>> tables.
>> - Multiple producers are updating data in Hive table
>> - Multiple consumer reading data from the Hive table
>>
>> Consumption pattern:
>> - Get all data that has been updated since last time I read.
>>
>> Is there any mechanism in Hive 3.0 which can enable above consumption
>> pattern?
>>
>> I see there is a construct of row__id(writeid, bucketid, rowid) in ACID
>> tables.
>> - Can row__id be used in this scenario?
>> - How is the "writeid" generated?
>> - Is there some meta information which captures the time when the rows
>> were actually visible for read?
>>
>

Re: Consuming delta from Hive tables

Reply via email to