Streaming Dataflow relies on high thread count for performance. Streaming threads spend a high percentage of time blocked on IO, so in order to get decent CPU utilization we need a lot of threads. Limiting the thread count risks causing performance issues.
On Fri, Aug 21, 2020 at 8:00 AM Kamil Wasilewski < kamil.wasilew...@polidea.com> wrote: > No, I'm not. But thanks anyway, I totally missed that option! > > It occurs in a simple pipeline that executes CoGroupByKey over two > PCollections. Reading from a bounded source, 20 millions and 2 millions > elements, respectively. One global window. Here's a link to the code, it's > one of our tests: > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/load_tests/co_group_by_key_test.py > > > On Thu, Aug 20, 2020 at 6:48 PM Luke Cwik <lc...@google.com> wrote: > >> +user <user@beam.apache.org> >> >> On Thu, Aug 20, 2020 at 9:47 AM Luke Cwik <lc...@google.com> wrote: >> >>> Are you using Dataflow runner v2[1]? >>> >>> If so, then you can use: >>> --number_of_worker_harness_threads=X >>> >>> Do you know where/why the OOM is occurring? >>> >>> 1: >>> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2 >>> 2: >>> https://github.com/apache/beam/blob/017936f637b119f0b0c0279a226c9f92a2cf4f15/sdks/python/apache_beam/options/pipeline_options.py#L834 >>> >>> On Thu, Aug 20, 2020 at 7:33 AM Kamil Wasilewski < >>> kamil.wasilew...@polidea.com> wrote: >>> >>>> Hi all, >>>> >>>> As I stated in the title, is there an equivalent for >>>> --numberOfWorkerHarnessThreads in Python SDK? I've got a streaming pipeline >>>> in Python which suffers from OutOfMemory exceptions (I'm using Dataflow). >>>> Switching to highmem workers solved the issue, but I wonder if I can set a >>>> limit of threads that will be used in a single worker to decrease memory >>>> usage. >>>> >>>> Regards, >>>> Kamil >>>> >>>>