Hello guys,
I've been struggling with this for some days now, without success, so I
would highly appreciate any enlightenment. The simplified scenario is the
following:
- I've got 2 topics in Kafka (it's already like that in production,
can't change it)
- transactions-created,
- transaction-processed
- Even though the schema is not exactly the same, they all share a
correlation_id, which is their "transaction_id"
So, long story short, I've got 2 consumers, one for each topic, and all I
wanna do is sink them in a chain order. I'm writing them w/ Spark
Structured Streaming, btw
So far so good, the caveat here is:
- I cannot write a given "*processed" *transaction unless there is an entry
of that same transaction with the status "*created*".
- There is *no* guarantee that any transactions in the topic "transaction-
*processed*" have a match (same transaction_id) in the "transaction-
*created*" at the moment the messages are fetched.
So the workflow so far is:
- Msgs from the "transaction-created" just get synced to postgres, no
questions asked
- As for the "transaction-processed", it goes as follows:
- a) Messages are fetched from the Kafka topic
- b) Select the transaction_id of those...
- c) Fetch all the rows w/ the corresponding id from a Postgres table
AND that have the status "CREATED"
- d) Then, a pretty much do a intersection between the two datasets, and
sink only on "processed" ones that have with step c
- e) Persist the resulting dataset
But the rows (from the 'processed') that were not part of the intersection
get lost afterwards...
So my question is:
- Is there ANY way to reprocess/replay them at all WITHOUT restarting the
app?
- For this scenario, should I fall back to Spark Streaming, instead of
Structured Streaming?
PS: I was playing around with Spark Streaming (legacy) and managed to
commit only the ones in the microbatches that were fully successful (still
failed to find a way to "poll" for the uncommitted ones without restarting,
though).
Thank you very much in advance!