k-serialization-tuning-vol.-1-choosing-your-serializer-if-you-can/
>
>
>
>
>
>
>
> *From:* Tommy May
> *Sent:* Tuesday, March 7, 2023 3:25 AM
> *To:* David Morávek
> *Cc:* Ken Krugler ; Flink User List <
> user@flink.apache.org>
> *Subject:* Re: Avoid
shuffling when reading pre-partitioned data from
Kafka
⚠EXTERNAL MESSAGE – CAUTION: Think Before You Click ⚠
Hi Ken & David,
Thanks for following up. I've responded to your questions below.
If the number of unique keys isn’t huge, I could think of yet another
helicopter stunt that you
Hi Ken & David,
Thanks for following up. I've responded to your questions below.
If the number of unique keys isn’t huge, I could think of yet another
> helicopter stunt that you could try :)
Unfortunately the number of keys in our case is huge, they're unique per
handful of events.
If your d
Using an operator state for a stateful join isn't great because it's meant
to hold only a minimal state related to the operator (e.g., partition
tracking).
If your data are already pre-partitioned and the partitioning matches (hash
partitioning on the JAVA representation of the key yielded by the
Hi Tommy,
To use stateful timers, you need to have a keyed stream, which gets tricky when
you’re trying to avoid network traffic caused by the keyBy()
If the number of unique keys isn’t huge, I could think of yet another
helicopter stunt that you could try :)
It’s possible to calculate a compo
Hello Ken,
Thanks for the quick response! That is an interesting workaround. In our
case though we are using a CoProcessFunction with stateful timers. Is there
a similar workaround path available in that case? The one possible way I
could find required partitioning data in kafka in a very specific
Hi Tommy,
I believe there is a way to make this work currently, but with lots of caveats
and constraints.
This assumes you want to avoid any network shuffle.
1. Both topics have names that return the same value for ((topicName.hashCode()
* 31) & 0x7) % parallelism.
2. Both topics have the
Hello,
My team has a Flink streaming job that does a stateful join across two high
throughput kafka topics. This results in a large amount of data ser/de and
shuffling (about 1gb/s for context). We're running into a bottleneck on
this shuffling step. We've attempted to optimize our flink configura