Re: beam using flink runner to achive data locality in a distributed setup?

Jan Lukavský Mon, 01 Jul 2024 00:22:06 -0700

H Gyorgy,

I don't think it is possible to co-locate tasks as you describe it. Beamhas no information about location of 'splits'. On the other hand, ifbatch throughput is the main concern, then reading from Kafka might notbe the optimal choice. Although Kafka provides tiered storage foroffloading historical data, it still somewhat limits scalability (andthus throughput), because the data have to be read by a broker and onlythen passed to a consumer. The parallelism is therefore limited by thenumber of Kafka partitions and not parallelism of the Flink job. A morescalable approach could be to persist data from Kafka to a batch storage(e.g. S3 or GCS) and reprocess it from there.


Best,

 Jan

On 6/29/24 09:12, Balogh, György wrote:

Hi,
I'm planning a distributed system with multiple kafka brokers colocated with flink workers.Data processing throughput for historic queries is a main KPI. So Iwant to make sure all flink workers read local data and not remote.I'm defining my pipelines in beam using java.
Is it possible? What are the critical config elements to achieve this?
Thank you,
Gyorgy

--

György Balogh
CTO
E       [email protected] <mailto:[email protected]>
M       +36 30 270 8342 <tel:+36%2030%20270%208342>
A       HU, 1117 Budapest, Budafoki út 209.
W       www.ultinous.com <http://www.ultinous.com>

Re: beam using flink runner to achive data locality in a distributed setup?

Reply via email to