Hi Jason, The idea behind a more scalable solution is to be able to first partition the data. For example, if we have a stream of orders with some order id contained in both streams, and event time lookups are within a specific order, we can introduce partitioning and thus increase parallelism.
Otherwise, if the join is performed on any element of the stream, we're forced to keep everything under a single bucket. On Sun, May 22, 2022, 04:22 Jason Politis <jpoli...@carrera.io> wrote: > Thank you Yuval! Will look into it first thing Monday morning, much > appreciated. In the case where we aren't able to filter by anything else, > is there anything else we can could potentially look into to help? > > > Thank you > > > Jason Politis > Solutions Architect, Carrera Group > carrera.io > | jpoli...@carrera.io <kpatter...@carrera.io> > <http://us.linkedin.com/in/jasonpolitis> > > > On Fri, May 20, 2022 at 11:46 PM Yuval Itzchakov <yuva...@gmail.com> > wrote: > >> Hi Jason, >> >> When using interval joins, Flink can't parallelize the execution as the >> join key (semantically) is even time, thus all events must fall into the >> same partition for Flink to be able to lookup events from the two streams. >> See the IntervalJoinOperator ( >> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/co/IntervalJoinOperator.java) >> for the details. >> >> If you have any other key(s) that could join both streams, and then >> filter by the time of the event at a later phase, that could speed things >> up. >> >> On Fri, May 20, 2022, 23:53 Jason Politis <jpoli...@carrera.io> wrote: >> >>> Good evening all, >>> >>> We are working on a project where a few queries that are joining based >>> on dates from table A are between dates from table B. Something like: >>> >>> SELECT >>> A.ID, >>> B.NAME >>> FROM >>> A, >>> B >>> WHERE >>> A.DATE BETWEEN B.START_DATE AND B.END_DATE; >>> >>> Both A and B are topics in Kafka with 5 partitions. Doing a simple test >>> of selecting * on each one will yield tasks with a parallelism of 5, which >>> tells me we have parallelism working in flink. >>> >>> BUT, when we attempt the query I pasted above, the BETWEEN clause >>> doesn't parallelize. >>> >>> I'd like to get your expert opinion on this and get your help on how to >>> force this to parallelize. >>> >>> Thank you >>> >>> >>> Jason Politis >>> Solutions Architect, Carrera Group >>> carrera.io >>> | jpoli...@carrera.io <kpatter...@carrera.io> >>> <http://us.linkedin.com/in/jasonpolitis> >>> >>