In our samza app, we need to read data from MySQL (reference table) with a
stream. So the requirements are
* Read data into each Samza task before processing any message.
* The Samza task should be able to listen to updates happening in MySQL.
I did some research after scanning through some relevant conversations and
JIRAs on the community but did not find a solution yet. Neither I find a
recommended way to do this.
If my data streams comes from a topic called *topicD*, options in my mind
are:
- Use Kafka
1. Use one of CDC based solution to replicate data in MySQL to a
topic Kafka. https://github.com/wushujames/mysql-cdc-projects/wiki.
Say the topic is called *topicR*.
2. In my Samza app, read reference table from *topicR *and persisted
in a cache in each Samza task's local storage.
- If the data in *topicR *is NOT partitioned in the same way as
*topicD*, can we configure each individual Samza task to read data
from all partitions from a topic?
- If the answer to the above question is no, do I need to
create *topicR
*with the same number of partitions as *topicD*, and replicate
data to all partitions?
- On start, how to make Samza task to block processing the first
message from *topicD* before reading all data from *topicR*.
3. Any new updates/deletes to *topicR *will be consumed to update the
local cache of each Samza task.
4. On failure or restarts, each Samza task will read from the
beginning from *topicR*.
- Not Use Kafka
- Each Samza task reads a Snapshot of database and builds its local
cache, and it then needs to read periodically to update its
local cache. I
have read about a few blogs, and this doesn't sound a solid way
in the long
term.
Any thoughts?
Chen
-
--
Chen Song