Hi Jason,

The idea behind a more scalable solution is to be able to first partition
the data. For example, if we have a stream of orders with some order id
contained in both streams, and event time lookups are within a specific
order, we can introduce partitioning and thus increase parallelism.

Otherwise, if the join is performed on any element of the stream, we're
forced to keep everything under a single bucket.

On Sun, May 22, 2022, 04:22 Jason Politis <jpoli...@carrera.io> wrote:

> Thank you Yuval!  Will look into it first thing Monday morning, much
> appreciated.  In the case where we aren't able to filter by anything else,
> is there anything else we can could potentially look into to help?
>
>
> Thank you
>
>
> Jason Politis
> Solutions Architect, Carrera Group
> carrera.io
> | jpoli...@carrera.io <kpatter...@carrera.io>
> <http://us.linkedin.com/in/jasonpolitis>
>
>
> On Fri, May 20, 2022 at 11:46 PM Yuval Itzchakov <yuva...@gmail.com>
> wrote:
>
>> Hi Jason,
>>
>> When using interval joins, Flink can't parallelize the execution as the
>> join key (semantically) is even time, thus all events must fall into the
>> same partition for Flink to be able to lookup events from the two streams.
>> See the IntervalJoinOperator (
>> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/co/IntervalJoinOperator.java)
>> for the details.
>>
>> If you have any other key(s) that could join both streams, and then
>> filter by the time of the event at a later phase, that could speed things
>> up.
>>
>> On Fri, May 20, 2022, 23:53 Jason Politis <jpoli...@carrera.io> wrote:
>>
>>> Good evening all,
>>>
>>> We are working on a project where a few queries that are joining based
>>> on dates from table A are between dates from table B.  Something like:
>>>
>>> SELECT
>>> A.ID,
>>> B.NAME
>>> FROM
>>> A,
>>> B
>>> WHERE
>>> A.DATE BETWEEN B.START_DATE AND B.END_DATE;
>>>
>>> Both A and B are topics in Kafka with 5 partitions.  Doing a simple test
>>> of selecting * on each one will yield tasks with a parallelism of 5, which
>>> tells me we have parallelism working in flink.
>>>
>>> BUT, when we attempt the query I pasted above, the BETWEEN clause
>>> doesn't parallelize.
>>>
>>> I'd like to get your expert opinion on this and get your help on how to
>>> force this to parallelize.
>>>
>>> Thank you
>>>
>>>
>>> Jason Politis
>>> Solutions Architect, Carrera Group
>>> carrera.io
>>> | jpoli...@carrera.io <kpatter...@carrera.io>
>>> <http://us.linkedin.com/in/jasonpolitis>
>>>
>>

Reply via email to