Re: [D] Pseudo-CDC - polled pipeline runs? (hop)

via GitHub Sat, 05 Apr 2025 00:30:01 -0700


GitHub user bamaer added a comment to the discussion: Pseudo-CDC - polled 
pipeline runs?


Similar use cases are perfectly doable and afaik widely implemented. 

Two possible scenarios: 

*) Table input to get the last updated date/id/whatever from the target table. 
This date/id should return a single row that is fed into a second table input 
transform that fetches everything from the source table with a where clause 
like `where date/id > ?`. The `?` takes the last date/id from the target table 
input with the `Insert data from transform` option. This wil 

*) For smaller tables or files: copy the "old" version (last day, last hour) of 
the data to a separate table. With that old table/file in place, use a Merge 
Rows Diff transform to compare the old version of the data to the latest 
version on the date/id. This will give you a flag field for new, identical, 
updated or deleted rows. That flag field can be used to process using your own 
logic or with a "Synchronize after merge" transform. 

If you want to run this very frequently or if there's a lot of data to process, 
you could add a watchdog pattern, where you write a status file or add a row to 
a database table. If that status file or row has an `active` status, your 
workflow can decide to do nothing, or start the syncing pipeline if there's no 
active process.

GitHub link: 
https://github.com/apache/hop/discussions/5134#discussioncomment-12732670

----
This is an automatically sent email for users@hop.apache.org.
To unsubscribe, please send an email to: users-unsubscr...@hop.apache.org

Re: [D] Pseudo-CDC - polled pipeline runs? (hop)

Reply via email to