Hi Vishal,
your assumptions sound reasonable to me. The community is currently
working on a more fine-grained back pressuring with credit-based flow
control. It is on the roamap for 1.5 [1]/[2]. I will loop in Nico that
might tell you more about the details. Until then I guess you have to
implement a custom source/adapt an existing source to let the data flow
in more realistic.
Regards,
Timo
[1]
http://flink.apache.org/news/2017/11/22/release-1.4-and-1.5-timeline.html
[2] https://www.youtube.com/watch?v=scStdhz9FHc
Am 1/2/18 um 4:02 PM schrieb Vishal Santoshi:
I did a simulation on session windows ( in 2 modes ) and let it rip
for about 12 hours
1. Replay where a kafka topic with retention of 7 days was the source
( earliest )
2. Start the pipe with kafka source ( latest )
I saw results that differed dramatically.
On replay the pipeline stalled after good ramp up while in the second
case the pipeline hummed on without issues. For the same time period
the data consumed is significantly more in the second case with the WM
progression stalled in the first case with no hint of resolution ( the
incoming data on source topic far outstrips the WM progression ) I
think I know the reasons and this is my hypothesis.
In replay mode the number of windows open do not have an upper bound.
While buffer exhaustion ( and data in flight with watermark ) is the
reason for throttle, it does not really limit the open windows and in
fact creates windows that reflect futuristic data ( future is relative
to the current WM ) . So if partition x has data for watermark time
t(x) and partition y for watermark time t(y) and t(x) << t(y) where
the overall watermark is t(x) nothing significantly throttles
consumption from the y partition ( in fact for x too ) , the bounded
buffer based approach does not give minute control AFAIK as one would
hope and that implies there are far more open windows than the system
can handle and that leads to the pathological case where the buffers
fill up ( I believe that happens way late ) and throttling occurs but
the WM does not proceed and windows that could ease the glut the
throttling cannot proceed..... In the replay mode the amount of data
implies that the Fetchers keep pulling data at the maximum consumption
allowed by the open ended buffer approach.
My question thus is, is there any way to have a finer control of back
pressure, where in the consumption from a source is throttled
preemptively ( by for example decreasing the buffers associated for a
pipe or the size allocated ) or sleeps in the Fetcher code that can
help aligning the performance to have real time consumption
characteristics
Regards,
Vishal.