Hi Jan,
Separating live and historic storage makes sense. I need a historic storage
that can ensure data local processing using the beam - flink stack.
Can I surely achieve this with HDFS? I can colocate hdfs nodes with flink
workers. What exactly enforces that flink nodes will read local and not
remote data?
Thank you,
Gyorgy

On Mon, Jul 1, 2024 at 3:42 PM Jan Lukavský <[email protected]> wrote:

> Hi Gyorgy,
>
> comments inline.
> On 7/1/24 15:10, Balogh, György wrote:
>
> Hi Jan,
> Let me add a few more details to show the full picture. We have live
> datastreams (video analysis metadata) and we would like to run both live
> and historic pipelines on the metadata (eg.: live alerts, historic video
> searches).
>
> This should be fine due to Beam's unified model. You can write a
> PTransform that handles PCollection<...> without the need to worry if the
> PCollection was created from Kafka or some bounded source.
>
> We planned to use kafka to store the streaming data and directly run both
> types of queries on top. You are suggesting to consider having kafka with
> small retention to server the live queries and store the historic data
> somewhere else which scales better for historic queries? We need to have on
> prem options here. What options should we consider that scales nicely (in
> terms of IO parallelization) with beam? (eg. hdfs?)
>
> Yes, I would not say necessarily "small" retention, but probably "limited"
> retention. Running on premise you can choose from HDFS or maybe S3
> compatible minio or some other distributed storage, depends on the scale
> and deployment options (e.g. YARN or k8s).
>
> I also happen to work on a system which targets exactly these
> streaming-batch workloads (persisting upserts from stream to batch for
> reprocessing), see [1]. Please feel free to contact me directly if this
> sounds interesting.
>
> Best,
>
>  Jan
>
> [1] https://github.com/O2-Czech-Republic/proxima-platform
>
> Thank you,
> Gyorgy
>
> On Mon, Jul 1, 2024 at 9:21 AM Jan Lukavský <[email protected]> wrote:
>
>> H Gyorgy,
>>
>> I don't think it is possible to co-locate tasks as you describe it. Beam
>> has no information about location of 'splits'. On the other hand, if batch
>> throughput is the main concern, then reading from Kafka might not be the
>> optimal choice. Although Kafka provides tiered storage for offloading
>> historical data, it still somewhat limits scalability (and thus
>> throughput), because the data have to be read by a broker and only then
>> passed to a consumer. The parallelism is therefore limited by the number of
>> Kafka partitions and not parallelism of the Flink job. A more scalable
>> approach could be to persist data from Kafka to a batch storage (e.g. S3 or
>> GCS) and reprocess it from there.
>>
>> Best,
>>
>>  Jan
>> On 6/29/24 09:12, Balogh, György wrote:
>>
>> Hi,
>> I'm planning a distributed system with multiple kafka brokers co located
>> with flink workers.
>> Data processing throughput for historic queries is a main KPI. So I want
>> to make sure all flink workers read local data and not remote. I'm defining
>> my pipelines in beam using java.
>> Is it possible? What are the critical config elements to achieve this?
>> Thank you,
>> Gyorgy
>>
>> --
>>
>> György Balogh
>> CTO
>> E [email protected] <[email protected]>
>> M +36 30 270 8342 <+36%2030%20270%208342>
>> A HU, 1117 Budapest, Budafoki út 209.
>> W www.ultinous.com
>>
>>
>
> --
>
> György Balogh
> CTO
> E [email protected] <[email protected]>
> M +36 30 270 8342 <+36%2030%20270%208342>
> A HU, 1117 Budapest, Budafoki út 209.
> W www.ultinous.com
>
>

-- 

György Balogh
CTO
E [email protected] <[email protected]>
M +36 30 270 8342 <+36%2030%20270%208342>
A HU, 1117 Budapest, Budafoki út 209.
W www.ultinous.com

Reply via email to