Hi Jessy,
let me try to answer some of your questions.
> 16 Task Managers with 1 task slot and 1 CPU each
Every additional task manager also involves management overhead. So I
would suggest option 1. But in the end you need to perform some
benchmarks yourself. I could also imagine that a mixture could be
beneficial depending on the isolation level of your pipelines.
> there will be a single copy in the HEAP
From the JavaDocs of BroadcastState:
"Each operator instance individually maintains and stores elements in
the broadcast state."
I would assume that the HEAP contains n copies where n is the parallelism.
> I can process only n events/seconds(if the latency of the pipeline is
1s.)
Latency is not necessarily thoughput. This depends on the pipeline. For
example, if the pipeline contains only map functions without any keyBy.
The operators are "chained" together. You can also influence the
chaining [1] for better resource utilization. If you pipeline contains
I/O to external systems, you can use async IO to increase the
throughput. [2]
I would also recommend this section here:
https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#task-slots-and-resources
Maybe you can elaborate on the operations in your pipeline.
> Can Flink process multiple events from the same key at the same time?
No, every key is routed to and executed by the same thread. However, you
create an artificial key to spread the load more evenly.
> any blogs regarding the results of Flink's load testing
I would also recommend the FlinkForward YouTube channel. A lot of users
stories including actual numbers and configurations are shown there.
Regards,
Timo
[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#task-chaining-and-resource-groups
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/
On 10.12.21 11:46, Jessy Ping wrote:
Hi All,
I have the following questions regarding the sizing of the Flink cluster
doing stateful computation using Datastream API. It will be better if
the community can answer the below questions or doubts.
Suppose we have a pipeline as follows,
*Kafka real time events source1 & Kafka rules source 2 ->
KeyedBroadcastProcessFunction -> Kafka Sink*
As you can see, we will be processing the real-time events from the
Kafka source using the rules broadcasted from the rule source with the
help of keyed broadcast function.
_Questions_
* I have a machine with 16 CPUs and 32 GB Ram. Which configuration is
efficient for achieving the target parallelism of 16?
1. A single task manager with 16 task slots
2. 16 Task Managers with 1 task slot and 1 CPU each.
* If I have a broadcast state in my pipeline and I have a single task
manager with 16 task slots for achieving the target parallelism of
16. Does Flink keep 16 copies of broadcast state in the single task
manager or there will be a single copy in the HEAP for the entire
task slots?
* If a parallelism of n means, I can process only n events/seconds(if
the latency of the pipeline is 1s.). How many requests a single task
slot (containing a single task) can execute at a time ?
* Can Flink process multiple events from the same key at the same time?
* I have found the following blog regarding the Flink cluster size,
https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines
<https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines>.
Do we have some other blogs, testimonials, or books regarding the
sample production setup/configuration of a Flink cluster for
achieving different ranges of throughput ?
* Are there any blogs regarding the results of Flink's load testing
results ?
Thanks
Jessy