Re: [D] Pseudo-CDC - polled pipeline runs? (hop)

via GitHub Fri, 04 Apr 2025 18:54:48 -0700


GitHub user casesolved-co-uk edited a discussion: Pseudo-CDC - polled pipeline 
runs?


In certain circumstances it may not be desirable to go to the complication of 
installing Debezium, Kafka and proper CDC. It may be sufficient (e.g. small 
data) to do pseudo-CDC, i.e. polled pipeline runs, e.g. every minute.

Consider this:

- Many tables with a common `modified_at` datetime field (assuming this has 
sufficient resolution to not overlap; could also be an integer, unique primary 
key, etc as long as it is comparable)
- A Hop configuration parameter `synced_to` datetime field
- A fetch size
- A poll interval

Then repeated:
SELECT * FROM sometable WHERE modified_at>${synced_to} ORDER BY modified_at ASC 
LIMIT ${fetch_size}

After each run the `synced_to` parameter is updated with the last `modified_at` 
result retrieved.

If len(result) == `fetch_size`, the pipeline is repeated immediately.
Else the pipeline is scheduled after `poll interval`.

Can Hop do that, maybe with a workflow?

GitHub link: https://github.com/apache/hop/discussions/5134

----
This is an automatically sent email for users@hop.apache.org.
To unsubscribe, please send an email to: users-unsubscr...@hop.apache.org

Re: [D] Pseudo-CDC - polled pipeline runs? (hop)

Reply via email to