Re: Consuming delta from Hive tables

Alan Gates Fri, 17 May 2019 10:07:40 -0700

Sorry, looks like you sent this earlier and I missed it.

A couple of things.  One, write_id is per transaction per table.  So for
table T, all rows written in w1 will have the same write_id, though they
will each have their own monotonically increasing row_ids.  Row_ids are
scoped by a write_id, so if both w1 and w2 insert a 100 rows, w1 would have
write_id 1, and row_ids 0-99 while w2's rows would have write_id 2 and
row_ids 0-99.

Two, If w1 and w2 both attempted to update or delete (not insert) records
from the same partition of table T, then w1 would fail at commit time
because it would see that w2 had already committed and there's a possible
conflict.  This avoids lost updates and deleted records magically
reappearing.

Alan.

On Fri, May 17, 2019 at 4:44 AM Bhargav Bipinchandra Naik (Seller
Platform-BLR) <bhargav.n...@flipkart.com> wrote:

> Is the following scenario supported?
>
> *timestamp:* t1 < t2 < t3 < t4 < t5 < t6
>
> *w1 -* transaction which updates subset of rows in table T {start_time:
> t1, end_time: t5}
> *w2 -* transaction which updates subset of rows in table T {start_time:
> t2, end_time: t3}
> *r1 - *job which reads rows from table T {start_time: t4}
> *r2 - *job which reads rows from table T {start_time: t6}
>
> - Is the write_id strictly increasing number across rows?
> - Is the write_id a version number per row and not a global construct?
> - Will the subset of rows updated by w1 have write_ids greater than
> write_ids of row updated by w2?
>
> Say if job r1 consumed the data at t4 had maximum write_id 100.
> Will rows updated by job w1 (end_time: t5) always have write_id > 100?
>
> Basically I need some kind of checkpoint using which the next run of the
> read job can read only the data updated since the checkpoint.
>
> Thanks,
> -Bhargav
>
>
>
>
>

Re: Consuming delta from Hive tables

Reply via email to