Hi,

I am trying to solve this problem - in my streaming flow, every day few
jobs fail due to some (say kafka cluster maintenance etc, mostly
unavoidable) reasons for few batches and resumes back to success.
I want to reprocess those failed jobs programmatically (assume I have a way
of getting start-end offsets for kafka topics for failed jobs). I was
thinking of these options:
1) Somehow pause streaming job when it detects failing jobs - this seems
not possible.
2) From driver - run additional processing to check every few minutes using
driver rest api (/api/v1/applications...) what jobs have failed and submit
batch jobs for those failed jobs

1 - doesn't seem to be possible, and I don't want to kill streaming context
just for few failing batches to stop the job for some time and resume after
few minutes.
2 - seems like a viable option, but a little complicated, since even the
batch job can fail due to whatever reasons and I am back to tracking that
separately etc.

Does anyone has faced this issue or have any suggestions?

Thanks,
KP

Reply via email to