It is correct that keyBy and partition operations will distribute
messages over the network
as they distribute the data across all subtasks. For this use-case we
only want to consider
subtasks that are subsequent to our operator, like a local keyBy.
I don't think there is an obvious way to implement it, but I'm currently
theory-crafting a bit
and will get back to you.
On 11.10.2017 14:52, Sanne de Roever wrote:
Hi,
Currently we need 75 Kafka partitions per topic and a parallelism of
75 to meet required performance, increasing the partitions and
parallelism gives diminished returns
Currently the performance is approx. 1500 msg/s per core, having one
pipeline (source, map, sink) deployed as one instance per core.
The Kafka source performance is not an issue. The map is very heavy
(deserialization, validation) on rather complex Avro messages. Object
reuse is enabled.
Ideally we would like to decouple Flink processing parallelism from
Kafka partitions in a following manner:
* Pick a source parallelism
* Per source, be able to pick a parallelism for the following map
* In such a way that some message key determines which -local- map
instance gets a message from a certain visitor
* So that messages with the same visitor key get processed by the
same map and in order for that visitor
* Output the result to Kafka
AFAIK keyBy, partitionCustom will distribute messages over the network
and rescale has no affinity for message identity.
Am I missing something obvious?
Cheers,
Sanne