Hi Morgan, Thanks for your reply.
I think the only possible way to determine this limit is load testing. In the end, this is all load testing is about. I can only suggest testing parts of the system separately to know their individual limits (e.g. IO, CPU). Ideally, this should be done on a regular basis. Hope this helps. Regards, Roman On Tue, Feb 25, 2020 at 2:47 PM Morgan Geldenhuys < [email protected]> wrote: > Hi Roman, > > Thank you for the reply. > > Yes, I am aware that backpressure can be the result of many factors and > yes this is an oversimplification of something very complex, please bare > with me. Lets assume that this has been taken into account and has lowered > the threshold for when this status permanently comes into effect, i.e. HIGH. > > Example: The system is running along perfectly fine under normal > conditions, accessing external sources, and processing at an average of > 100,000 messages/sec. Lets assume the maximum capacity is around 130,000 > message/sec before back pressure starts propagating messages back up the > stream. Therefore, utilization is at 0.76 (100K/130K). Great, but at > present we dont know that 130,000 is the limit. > > For this example or for any job, is there a way of finding this maximum > capacity (and hence the utilization) without pushing the system to its > limit based on the current throughput? Possibly by measuring (as you say) > the saturation of certain buffers (looking into this now, however, i am not > too familiar with flink internals)? It doesn't have to be extremely > precise. Any hints would be greatly appreciated. > > Regards, > M. > > On 25.02.20 13:34, Khachatryan Roman wrote: > > Hi Morgan, > > Regarding backpressure, it can be caused by a number of factors, e.g. > writing to an external system or slow input partitions. > > However, if you know that a particular resource is a bottleneck then it > makes sense to monitor its saturation. > It can be done by using Flink metrics. Please see the documentation for > more details: > > https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html > > Regards, > Roman > > > On Tue, Feb 25, 2020 at 12:33 PM Morgan Geldenhuys < > [email protected]> wrote: > >> Hello community, >> >> I am fairly new to Flink and have a question concerning utilization. I >> was hoping someone could help. >> >> Knowing that backpressure is essentially the point at which utilization >> has reached 100% for any particular streaming pipeline and means that >> the application cannot "keep up" with the messages coming into the system. >> >> I was wondering, assuming a fairly stable input throughput, is there a >> way of determining the average utilization as a percentage? Can we >> determine how much more capacity each operator has before backpressure >> kicks in from metrics alone, i.e. 60% of capacity for example? Knowing >> that the maximum throughput of the DSP application is dictated by the >> slowest part of the pipeline, we would need to identify the slowest >> operator and then average horizontally. >> >> The only method that I can see of determining the point at which the >> system cannot keep up any longer is by scaling the input throughput >> slowly until the backpressure HIGH alarm is shown and hence the number >> of messages/sec is known. >> >> Yes I know this is a gross oversimplification and there are many many >> factors that need to be taken into account when dealing with >> backpressure, but it would be nice to have a general indicator, a rough >> estimate is fine. >> >> Thank you in advance. >> >> Regards, >> M. >> >> >> >> >
