I am running into a problem when processing the past 7 days of data from
multiple streams.  I am trying to union the streams based on event
timestamp.

The problem is that there are streams are significant big than other
streams. For example if one stream has 1,000 event/sec and the other stream
has 1,000,000 event/sec.

I am using a PrirotyQueue to sort the event based on event timestamp. Since
the fast(smaller) streams watermarks moves much faster than the
slow(bigger) streams, there are lots of events  from the faster streams
ended up in the Queue waiting for the slower stream to catch up and
eventually ran out of memory.

Is there anyway we can send back pressure on the fast streams so they can
slow down?  or somehow to coordinate the watermarks between all the streams?

I am planning to use an external storage to tracking the low watermarks
between all the streams. so we don't read the event we cannot handle into
the PriorityQueue.

Any better suggestions?

Thanks,
Tao

Reply via email to