Hi Theo,

I think you're right that there is currently no good built-in solution for your use case. What you would ideally like to have is some operation that can buffer records and "hold back" the watermark according to the timestamps of the records that are in the buffer. This has the benefit that it doesn't require adding a watermark operation in the middle of the graph, which can lead to problems because you need to make sure to do good per-partition watermarking, which is ideally done in a source.

It's certainly possible to write an operator that can do such buffering and holding back the watermark, you would have to override processWatermark() keep track of the timestamps of records and the max watermark seen so far and emit a modified watermark downstream whenever there are updates and you evict records from the buffer.

I'd be open to suggestions if someone wants to add a proper solution/API for this to Flink.

Best,
Aljoscha

On 07.09.20 16:51, Theo Diefenthal wrote:
Hi Aljoscha,

We have a ProcessFunction which does some processing per kafka partition. It 
basically buffers the incoming data over 1 minute and throws out some events 
from the stream if within the minute another related event arrived.

In order to buffer the data and store the events over 1 minute, we perform a 
groupBy(kafkaPartition) and thus have KeyedState per kafka-partition in there. 
The process function however messes up the Watermarks (Due to the buffering). 
Concluding, we have to reassign the watermarks and want to do it the same way 
as the source kafka assigner would have done. We can only do this if the 
Timestamp assigner is aware of the Subtask index.

I solved this issue here by not using "assignTimestampAndWatermark" but instead extending 
TimestampsAndWatermarksOperator and putting a "transform" into the DataStream. I don't 
like this solution much but it solves the needs here.

Ideally, I would love to not need state here at all (and buffer in RAM only). I would really love 
to extend the "buffer logic" directly inside the KafkaConsumer so that there is nothing 
to store on a checkpoint (except for the kafka offsets which would be stored anyways, but just 
different ones). But that's another topic, I will open a thread on the dev mailing list with a 
feature request for it soon. Or do you have any other idea how to buffer the data per kafka 
partition? We have the nice semantic that one kafka partition receives data from one of our servers 
so that we have ascneding timestamps per partition and can guarantee that some related events 
always arrive in the same partition. I think that's a rather common usecase in Flink which can 
optimize the latency a lot, so I would love to have some more features directly from Flink to 
better support "processing per kafka partition" without the need to shuffle.

Best regards
Theo

----- Ursprüngliche Mail -----
Von: "Aljoscha Krettek" <aljos...@apache.org>
An: "user" <user@flink.apache.org>
Gesendet: Montag, 7. September 2020 11:07:35
Betreff: Re: How to access state in TimestampAssigner in Flink 1.11?

Hi,

sorry for the inconvenience! I'm sure we can find a solution together.

Why do you need to keep state in the Watermark Assigner? The Kafka
source will by itself maintain the watermark per partition, so just
specifying a WatermarkStrategy will already correctly compute the
watermark per partition and then combine them together.

Best,
Aljoscha

On 20.08.20 08:08, Till Rohrmann wrote:
Hi Theo,

thanks for reaching out to the community. I am pulling in Aljoscha and Klou
who have worked on the new WatermarkStrategy/WatermarkGenerator abstraction
and might be able to help you with your problem. At the moment, it looks to
me that there is no way to combine state with the new WatermarkGenerator
abstraction.

Cheers,
Till

On Wed, Aug 19, 2020 at 3:37 PM Theo Diefenthal <
theo.diefent...@scoop-software.de> wrote:

Hi there,

Right now I'm in the process of upgrading our Flink 1.9 jobs to Flink 1.11.

In Flink 1.9, I was able to write a AssignerWithperiodicWatermarks which
also extended AbstractRichFunction and could thus utilize State and
getRuntimeContext() in there. This worked as the
TimestampsAndWatermarksOperator was a AbstractUdfStreamOperator and passed
my assigner in as the userFunction to that operator.

I used this feature for some "per partition processing" which Flinks
somehow isn't ideally suited for at the moment I guess. We have ascending
watermarks per kafka partition and do some processing on that. In order to
maintain state per kafka-partition, I now keyby kafkapartition in our
stream (not ideal but better than operatorstate in terms of rescaling) but
afterwards need to emulate the watermark strategy from the initial kafka
source, i.e. reassign watermarks the same way as the kafka source did (per
kafka partition within the operator). Via getRuntimeContext() I am/was able
to identify the kafkaPartitions one operatorinstance was responsible for
and could produce the outputwatermark accordingly. (min over all
responsible partitions).

In Flink 1.11, how can I rebuild this behavior? Do I really need to build
my own TimestampsAndWatermarksOperator which works like the old one? Or is
there a better approach?

Best regards
Theo



Reply via email to