Hi, I am trying to solve this problem - in my streaming flow, every day few jobs fail due to some (say kafka cluster maintenance etc, mostly unavoidable) reasons for few batches and resumes back to success. I want to reprocess those failed jobs programmatically (assume I have a way of getting start-end offsets for kafka topics for failed jobs). I was thinking of these options: 1) Somehow pause streaming job when it detects failing jobs - this seems not possible. 2) From driver - run additional processing to check every few minutes using driver rest api (/api/v1/applications...) what jobs have failed and submit batch jobs for those failed jobs
1 - doesn't seem to be possible, and I don't want to kill streaming context just for few failing batches to stop the job for some time and resume after few minutes. 2 - seems like a viable option, but a little complicated, since even the batch job can fail due to whatever reasons and I am back to tracking that separately etc. Does anyone has faced this issue or have any suggestions? Thanks, KP