Hello,

There is one Flink job which consumes from Kafka topic (TOPIC_IN), simply
flatMap / map data and push to another Kafka topic (TOPIC_OUT).
TOPIC_IN has around 30 partitions, data is more or less sequential per
partition and the job has parallelism 30. So in theory there should be 1:1
mapping between consumer and partition.

But it's often to see big lag in offsets for some partitions. So that
should mean that some of consumers are slower than another (i.e. some
network issues for particular broker host or anything else). So data in
TOPIC_OUT partitions is distributed but not sequential at all.

So when some another flink job consumes from TOPIC_OUT and uses
BoundedOutOfOrdernessTimestampExtractor to generate watermarks, due to
difference in data timestamps, there can be a lot of late data. Maybe
something is missing of course in this setup or there is more good approach
for such flatMap / map jobs.

Setting big WindowedStream#allowedLateness or giving more time for
BoundedOutOfOrdernessTimestampExtractor will increase memory consumption
and probably will cause another issues and anyway there can be late data
which is not good for later windows.

One of the solution is to have some shared place, to synchronize lower
timestamp between consumers and somehow slow down consumption (Thread
sleep, wait, while loop with condition...).

0. Is there any good approach to handle such "Kafka <-  flatMap / map ->
Kafka" tasks? so data in TOPIC_OUT will be sequential as in TOPIC_IN.

1. As far as I see it should be common problem with some slow consumers for
big Kafka topic with a lot of partitions, isn't it? How Flink/Kafka hadle
it?

2. Does somebody know, is there any mechanism in Flink - Kafka,
(backpreassure?), which can tell from child operator (some process function
for example) to specific fast consumers to slow down a bit? Is something
like callback possible in Flink, don't think so, but..?

3. Or is there in Flink already anything which can help to synchronize
minimum timestamps between consumers and?

4. Is there any good approach to slow down consumption in Kafka consumer?
There should be some problems between session timeout and poll I think or
something related to that, but maybe there is already some good solution
for that :)

Will be glad if somebody can give some hints for any of the questions,

Best,
Sasha

Reply via email to