Hi Matthis, Beam Notebooks (InteractiveRunner) supports notebook-managed Flink cluster, details see: https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development#interactive_flinkrunner_on_notebook-managed_clusters . You should also be able to see the "open source support" section of this blog: https://cloud.google.com/blog/products/data-analytics/interactive-beam-pipeline-ml-inference-at-scale-in-notebook .
If you have your own sdk container, use this configuration: options.view_as(PortableOptions).environment_config = 'YOUR_REGISTRY/beam_python3.x_sdk:2.4x.0' And leave the environment_type to its default "DOCKER". To control the job parallelism, use this parameter (not the one in the test file): # The parallelism is applied to each step, so if your pipeline has 10 steps, you # end up having 150 * 10 tasks scheduled that can theoretically be executed in parallel by # the 320 (upper bound) slots/workers/threads. options.view_as(FlinkRunnerOptions).parallelism = 150 To tune the throughput/parallelism for your flink cluster, here are a few example knobs: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/dataproc/dataproc_cluster_manager.py#L63-L69 Ning. On Sun, Dec 4, 2022 at 9:32 AM Matthis Leicht <matthis.lei...@gmail.com> wrote: > Hello everybody, > > does anyone have experience with setting up a flink-cluster to execute > beam pipelines over the python sdk? > > Because rootfull containers are not allowed where I want to run the > cluster, I try to do this in podman/podman-compose. > > I try to make a test pipeline work: > https://github.com/matleicht/podman_beam_flink > Environment: > Debian 11 with podman and python 3.2.9(with apache-beam==2.38.0 and > podman-compose) installed. > > The setup of the cluster defined in: > https://github.com/matleicht/podman_beam_flink/blob/master/docker-compose.yml > 1x flink-jobmanager (flink version 1.14) > 1x flink-taskmanager > 1x python harness sdk > I chose to create a sdk container manually because i don't have docker > installed and flink obviously fails, when it tries to create a container > over docker. > > When i try to run the pipeline ( > https://github.com/matleicht/podman_beam_flink/blob/master/pipeline_test.py) > the sdk worker get stuck with the container lock: > > 2022/12/04 16:13:02 Starting worker pool 1: python -m > apache_beam.runners.worker.worker_pool_main --service_port=50000 > --container_executable=/opt/apache/beam/boot > Starting worker with command ['/opt/apache/beam/boot', '--id=1-1', > '--logging_endpoint=localhost:45087', > '--artifact_endpoint=localhost:35323', > '--provision_endpoint=localhost:36435', > '--control_endpoint=localhost:33237'] > 2022/12/04 16:16:31 Failed to obtain provisioning information: failed to > dial server at localhost:36435 > caused by: > context deadline exceeded > > I suspect that I have an error in the network setup or there are some > configurations missing for the harness worker, but I could not figure out > the problem. > > Does anyone tried this himself on podman or see the error in container > configuration and could help me? > > My ultimate goal is to create a streaming pipeline to read data from Kafka > Instance, process it, and write back to it. > > Thanks for your help in advance. > Best regards > Matthis > >