Hi, I'm trying to build something in Flink that relies heavily on the Windowing features.
In essence what I want to build: I have clickstream data coming in via Kafka. Each record (click) has a sessionid and a timestamp. I want to create a window for each session and after 30 minutes idle I want all events for that session (visit) to be written to disk. This should result in the effect that a specific visit exists in exactly one file. Since HDFS does not like 'small files' I want to create a (set of) files every 15 minutes that contains several complete visits. So I need to buffer the 'completed visits' and flush them to disk in 15 minute batches. What I think I need to get this is: 1) A map function that assigns the visit-id (i.e. new id after 30 minutes idle) 2) A window per visit-id (close the window 30 minutes after the last click) 3) A window per 15 minutes that only contains windows of visits that are complete Today I've been trying to get this setup and I think I have some parts that are in the right direction. I have some questions and I'm hoping you guys can help me: 1) I have trouble understanding the way a windowed stream works "exactly". As a consequence I'm having a hard time verifying if my code does what I understand it should do. I guess what would really help me is a very simple example on how to unittest such a window. 2) Is what I describe above perhaps already been done before? If so; any pointers are really appreciated. 3) Am I working in the right direction for what I'm trying to achieve; or should I use a different API? a different approach? Thanks -- Best regards / Met vriendelijke groeten, Niels Basjes