The way that Flink handles session windows is that every new event is
initially assigned to its own session window, and then overlapping sessions
are merged. I imagine this is why you are seeing so many calls
to createAccumulator.

This implementation choice is deeply embedded in the code; I don't think it
can be avoided.

If you can afford to wait until a session ends to emit the session start
event, then you will only be reporting once for each session. Another
solution might be to implement your own windowing using a process function
-- but if you are using event time logic, and if the events can be
processed out of order, I suspect it would be difficult to do much better.

David

On Mon, Sep 5, 2022 at 9:45 PM Kristinn Danielsson via user <
user@flink.apache.org> wrote:

> Hi,
>
>
>
> I'm trying to migrate a KafkaStreams application to Flink (using
> DataStream API).
>
> The application consumes a high traffic (millions of events per second)
> Kafka
>
> topic and collects events into sessions keyed by id. To reduce the load on
>
> subsequent processing steps I want to output one event on session start
> and one
>
> event on session end. So, I set up a pipeline which keys the stream by id,
>
> aggregates the events over a event time session window with a gap of 4
> seconds.
>
> I also implemented a custom trigger to trigger when the first event
>
> arrives in a window.
>
>
>
> When I run this pipeline I somtimes observe that I get multiple calls to
> the
>
> aggregator's "createAccumulator" method for a given session id, and
> therefore I
>
> also get duplicate session start and session end events for the session id.
>
> So it looks to me that the Flink is collecting the events into multiple
> sessions
>
> even if they have the same session id.
>
>
>
> Examples:
>
>
>
>     Input events:
>
>         Event timestamp         Id
>
>         2022-09-06 08:00:00     ABC
>
>         2022-09-06 08:00:01     ABC
>
>         2022-09-06 08:00:02     ABC
>
>         2022-09-06 08:00:03     ABC
>
>         2022-09-06 08:00:04     ABC
>
>         2022-09-06 08:00:05     ABC
>
>
>
>     Problem 1:
>
>         Output events:
>
>             Event time              Id      Type
>
>             2022-09-06 08:00:00     ABC     Start
>
>             2022-09-06 08:00:03     ABC     End
>
>             2022-09-06 08:00:04     ABC     Start
>
>             2022-09-06 08:00:05     ABC     End
>
>     Problem 2:
>
>         Output events:
>
>             Event time              Id      Type
>
>             2022-09-06 08:00:00     ABC     Start
>
>             2022-09-06 08:00:03     ABC     Start
>
>             2022-09-06 08:00:04     ABC     End
>
>             2022-09-06 08:00:05     ABC     End
>
>
>
>     Expected output:
>
>         Event time              Id      Type
>
>         2022-09-06 08:00:00     ABC     Start
>
>         2022-09-06 08:00:05     ABC     End
>
>
>
>
>
> Is this expected behaviour? How can I avoid getting duplicate session
> windows?
>
>
>
> Thanks for your help
>
> Kristinn
>

Reply via email to