Hi Hubert,
The most straight-forward reason for backpressure is under-provisioning of
the cluster. An application over time usually needs gradually more
resources. If the user base of your company grows, so does the amount of
messages (be it click stream, page impressions, or any kind of
transacti
One other thought: some users experiencing this have found it preferable to
increase the checkpoint timeout to the point where it is effectively
infinite. Checkpoints that can't timeout are likely to eventually complete,
which is better than landing in the vicious cycle you described.
David
On We
You should begin by trying to identify the cause of the backpressure,
because the appropriate fix depends on the details.
Possible causes that I have seen include:
- the job is inadequately provisioned
- blocking i/o is being done in a user function
- a huge number of timers are firing simultaneo