Hi, I am currently evaluating PyFlink in comparison to Java and did some various tests, mainly comparing identical pipelines with focus on throughput. For me it seems, that PyFlink is generally worse for wear and seems to reach its limits in throughput at a point where Java still has resources left (and can easily handle double the amount of data). After seeing the benchmarks at [0], I also tried larger data sizes, but I could not reproduce any of those findings. The only parameter seemingly to help was 'python.fn-execution.bundle.size', but even there limits were reached rather quickly.
I would mainly like to know, if this is expected/normal, or if maybe there are parameters and resources to adjust to help bring PyFlink /somewhat/ on par with the pure Java implementation. I appreciate any feedback on this. Thank you in advance. Best David [0]: https://flink.apache.org/2022/05/06/exploring-the-thread-mode-in-pyflink/
smime.p7s
Description: S/MIME cryptographic signature