Hi, I'd like to present an idea about generalizing the legacy streaming API a bit. The streaming API assumes an equi-frequent micro-batches model such that streaming data are allocated and jobs are submitted into a batch every fixed amount of time (aka batchDuration). This model could be extended a bit; instead of generating a batch every batchDuration, batch generation could be event based; such that Spark listens to event sources and generates batches upon events. The equi-frequent micro-batches becomes equivalent to a timer event source that fires a timer event every batchDuration.
This allows a fine grain scheduling of Spark jobs. The same code could run as either streaming or batching. With this model, a batch job could easily be configured to run periodically. There would be no need to deploy or configure an external scheduler (like Apache Oozie or linux crons). This model easily allows jobs with dependencies that span across time, like daily logs transformations concluded by weekly aggregations. Please find more details at https://github.com/mashin-io/rich-spark/blob/eventum-master/docs/reactive-spark-doc.md I'd like to discuss maybe how worth this idea could be (especially given the structured streaming API). Looking forward to having your precious feedback. Regards, Ahmed Mahran