H Gyorgy,
I don't think it is possible to co-locate tasks as you describe it. Beam
has no information about location of 'splits'. On the other hand, if
batch throughput is the main concern, then reading from Kafka might not
be the optimal choice. Although Kafka provides tiered storage for
offloading historical data, it still somewhat limits scalability (and
thus throughput), because the data have to be read by a broker and only
then passed to a consumer. The parallelism is therefore limited by the
number of Kafka partitions and not parallelism of the Flink job. A more
scalable approach could be to persist data from Kafka to a batch storage
(e.g. S3 or GCS) and reprocess it from there.
Best,
Jan
On 6/29/24 09:12, Balogh, György wrote:
Hi,
I'm planning a distributed system with multiple kafka brokers co
located with flink workers.
Data processing throughput for historic queries is a main KPI. So I
want to make sure all flink workers read local data and not remote.
I'm defining my pipelines in beam using java.
Is it possible? What are the critical config elements to achieve this?
Thank you,
Gyorgy
--
György Balogh
CTO
E [email protected] <mailto:[email protected]>
M +36 30 270 8342 <tel:+36%2030%20270%208342>
A HU, 1117 Budapest, Budafoki út 209.
W www.ultinous.com <http://www.ultinous.com>