Hi, I have a similar sounding use case and just yesterday was experimenting with this approach:
Use 2 separate streams: one for model events, one for data events. Connect these 2, key the resulting stream and then use a RichCoFlatMapFunction to ensure that each data event is enriched with the latest model event as soon as a new model event arrives. Also, as soon as a new model arrives emit all previously seen events with this new model events. This involves keeping events and models in state. My emitted enriched events have a command-like syntax (add/remove) so that downstream operators can remove/add as necessary depending on the calculations (so for each model change I would emit an add/remove pair of enriched events). As I say I have only experimented with this yesterday, perhaps someone a bit more experienced with flink might spot some problems with this approach, which I would definitely be interested in hearing. Kat On Wed, Mar 15, 2017 at 2:20 PM, Theodore Vasiloudis <theodoros.vasilou...@gmail.com> wrote: > Hello all, > > I've started thinking about online learning in Flink and one of the issues > that has come > up in other frameworks is the ability to prioritize "control" over "data" > events in iterations. > > To set an example, say we develop an ML model, that ingests events in > parallel, performs > an aggregation to update the model, and then broadcasts the updated model to > back through > an iteration/back edge. Using the above nomenclature the events being > ingested would be > "data" events, and the model update would a "control" event. > > I talked about this scenario a bit with couple of people (Paris and > Gianmarco) and one thing > we would like to have is the ability to prioritize the ingestion of control > events over the data events. > > If my understanding is correct, currently there is a buffer/queue of events > waiting to be processed > for each operator, and each incoming event ends up at the end of that queue. > > If our data source is fast, and the model updates slow, a lot of data events > might be buffered/scheduled > to be processed before each model update, because of the speed difference > between the two > streams. But we would like to update the model that is used to process data > events as soon as > the newest version becomes available. > > Is it somehow possible to make the control events "jump" the queue and be > processed as soon > as they arrive over the data events? > > Regards, > Theodore > > P.S. This is still very much a theoretical problem, I haven't looked at how > such a pipeline would > be implemented in Flink.