Re: Consuming delta from Hive tables

Bhargav Bipinchandra Naik (Seller Platform-BLR) Sun, 19 May 2019 23:21:50 -0700

Hi Alan,


Are write_ids monotonically increasing?
Are write_ids accessible in the hive query?
For e.g.:
select * from table_name where write_id > N;

Basically I am trying to understand if I can use write_id to consume only
updated rows.
Store the maximum write_id(X) seen in the result and next time query for
all rows with row_id greater than X.

Thanks,
Bhargav

On Fri, May 17, 2019 at 10:37 PM Alan Gates <alanfga...@gmail.com> wrote:

> Sorry, looks like you sent this earlier and I missed it.
>
> A couple of things.  One, write_id is per transaction per table.  So for
> table T, all rows written in w1 will have the same write_id, though they
> will each have their own monotonically increasing row_ids.  Row_ids are
> scoped by a write_id, so if both w1 and w2 insert a 100 rows, w1 would have
> write_id 1, and row_ids 0-99 while w2's rows would have write_id 2 and
> row_ids 0-99.
>
> Two, If w1 and w2 both attempted to update or delete (not insert) records
> from the same partition of table T, then w1 would fail at commit time
> because it would see that w2 had already committed and there's a possible
> conflict.  This avoids lost updates and deleted records magically
> reappearing.
>
> Alan.
>
> On Fri, May 17, 2019 at 4:44 AM Bhargav Bipinchandra Naik (Seller
> Platform-BLR) <bhargav.n...@flipkart.com> wrote:
>
>> Is the following scenario supported?
>>
>> *timestamp:* t1 < t2 < t3 < t4 < t5 < t6
>>
>> *w1 -* transaction which updates subset of rows in table T {start_time:
>> t1, end_time: t5}
>> *w2 -* transaction which updates subset of rows in table T {start_time:
>> t2, end_time: t3}
>> *r1 - *job which reads rows from table T {start_time: t4}
>> *r2 - *job which reads rows from table T {start_time: t6}
>>
>> - Is the write_id strictly increasing number across rows?
>> - Is the write_id a version number per row and not a global construct?
>> - Will the subset of rows updated by w1 have write_ids greater than
>> write_ids of row updated by w2?
>>
>> Say if job r1 consumed the data at t4 had maximum write_id 100.
>> Will rows updated by job w1 (end_time: t5) always have write_id > 100?
>>
>> Basically I need some kind of checkpoint using which the next run of the
>> read job can read only the data updated since the checkpoint.
>>
>> Thanks,
>> -Bhargav
>>
>>
>>
>>
>>

Re: Consuming delta from Hive tables

Reply via email to