Hello everyone,

I've been struggling for a while to perform a session-window join using pyflink 
/ SQL API. From my understanding it seems that the table API requires an 
equality on the window_start and window_end parameters. Unfortunately, in the 
case of session join (with static gap) it is rarely the case that the window of 
both left and right streams will produce same start and end, leaving the window 
inner-join empty. In order to get around this issue i've tried to cast both my 
left and right tables to a common schema, union them and make a session window 
on the unioned stream, which seems to work for 2 or 3 tables but I'm hitting 
severe performance issues when trying to scale this solution (my use-case has 
15 tables with 10k records per second each on which I would like to perform a 
Session Window).

On the other hand, the looking at the description of the Datastream API of 
Session-Window 
Join<https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/datastream/operators/joining/#session-window-join>
 it seems to do exactly what I'm looking for:
When performing a session window join, all elements with the same key that when 
“combined” fulfill the session criteria are joined in pairwise combinations and 
passed on to the JoinFunction or FlatJoinFunction. Again this performs an inner 
join, so if there is a session window that only contains elements from one 
stream, no output will be emitted!
Unfortunately, it seems that WindowJoin is still not supported in the python 
API. After looking around in the code of the pyflink implementation, it seems 
to me that I could manage to propose a base implementation if pointed in the 
right direction by someone with experience on maintaining the pyflink 
repository. Would the community be interested in such support for Joins in 
python API ? If so I would be willing to get started with an implementation 
proposal and open a PR if the community.

Thanks a lot !
Best Regrards,
Hugo Polsinelli

Reply via email to