Is the following scenario supported? *timestamp:* t1 < t2 < t3 < t4 < t5 < t6
*w1 -* transaction which updates subset of rows in table T {start_time: t1, end_time: t5} *w2 -* transaction which updates subset of rows in table T {start_time: t2, end_time: t3} *r1 - *job which reads rows from table T {start_time: t4} *r2 - *job which reads rows from table T {start_time: t6} - Is the write_id strictly increasing number across rows? - Is the write_id a version number per row and not a global construct? - Will the subset of rows updated by w1 have write_ids greater than write_ids of row updated by w2? Say if job r1 consumed the data at t4 had maximum write_id 100. Will rows updated by job w1 (end_time: t5) always have write_id > 100? Basically I need some kind of checkpoint using which the next run of the read job can read only the data updated since the checkpoint. Thanks, -Bhargav On Tue, May 7, 2019 at 6:02 PM Bhargav Bipinchandra Naik (Seller Platform-BLR) <bhargav.n...@flipkart.com> wrote: > Hi Jesus and Alan, > > Thanks for the prompt reply. > Had a follow up question. > > *timestamp:* t1 < t2 < t3 < t4 < t5 < t6 > > *w1 -* transaction which updates subset of rows in table T {start_time: > t1, end_time: t5} > *w2 -* transaction which updates subset of rows in table T {start_time: > t2, end_time: t3} > *r1 - *job which reads rows from table T {start_time: t4} > *r2 - *job which reads rows from table T {start_time: t6} > > - Is the write_id strictly increasing number across rows? > - Is the write_id a version number per row and not a global construct? > - Will the subset of rows updated by c1 have write_ids greater than > write_ids of row updated by c2? > > Say if job r1 consumed the data at t4 had maximum write_id 100. > Will rows updated by job w1 (end_time: t5) always have write_id > 100? > > Basically I need some kind of checkpoint using which the next run of the > read job can read only the data updated since the checkpoint. > > Thanks, > Bhargav > > On Mon, May 6, 2019 at 11:39 PM Alan Gates <alanfga...@gmail.com> wrote: > >> The other issue is an external system has no ability to control when the >> compactor is run (it rewrites deltas into the base files and thus erases >> intermediate states that would interest you). The mapping of writeids >> (table specific) to transaction ids (system wide) is also cleaned >> intermittently, again erasing history. And there's no way to get the >> mapping from writeids to transaction ids from outside of Hive. >> >> Alan. >> >> On Mon, May 6, 2019 at 6:23 AM Bhargav Bipinchandra Naik (Seller >> Platform-BLR) <bhargav.n...@flipkart.com> wrote: >> >>> We have a scenario where we want to consume only delta updates from Hive >>> tables. >>> - Multiple producers are updating data in Hive table >>> - Multiple consumer reading data from the Hive table >>> >>> Consumption pattern: >>> - Get all data that has been updated since last time I read. >>> >>> Is there any mechanism in Hive 3.0 which can enable above consumption >>> pattern? >>> >>> I see there is a construct of row__id(writeid, bucketid, rowid) in ACID >>> tables. >>> - Can row__id be used in this scenario? >>> - How is the "writeid" generated? >>> - Is there some meta information which captures the time when the rows >>> were actually visible for read? >>> >>