Hello all,

I've started thinking about online learning in Flink and one of the issues
that has come
up in other frameworks is the ability to prioritize "control" over "data"
events in iterations.

To set an example, say we develop an ML model, that ingests events in
parallel, performs
an aggregation to update the model, and then broadcasts the updated model
to back through
an iteration/back edge. Using the above nomenclature the events being
ingested would be
"data" events, and the model update would a "control" event.

I talked about this scenario a bit with couple of people (Paris and
Gianmarco) and one thing
we would like to have is the ability to prioritize the ingestion of control
events over the data events.

If my understanding is correct, currently there is a buffer/queue of events
waiting to be processed
for each operator, and each incoming event ends up at the end of that queue.

If our data source is fast, and the model updates slow, a lot of data
events might be buffered/scheduled
to be processed before each model update, because of the speed difference
between the two
streams. But we would like to update the model that is used to process data
events as soon as
the newest version becomes available.

Is it somehow possible to make the control events "jump" the queue and be
processed as soon
as they arrive over the data events?

Regards,
Theodore

P.S. This is still very much a theoretical problem, I haven't looked at how
such a pipeline would
be implemented in Flink.

Reply via email to