Re: Dataflow runner v1 vs. v2 for Java pipelines

Marcin Kuthan Fri, 06 May 2022 07:08:57 -0700

Hi Robert

Thank you for the answer and creating the Beam issue. The documentation has
been also fixed, very nice!


Do you know when the runner v2 will become the default runner for Java
pipelines? It is a matter of months or years?

BTW. For testing purposes I sucesfully deployed one of the Google provided
Dataflow template (PubsubToBigQuery).
Although, some hacks are required, as documented here:
https://github.com/GoogleCloudPlatform/DataflowTemplates/issues/382


On Thu, 5 May 2022 at 18:18, Robert Bradshaw <rober...@google.com> wrote:

> FYI, I filed https://issues.apache.org/jira/browse/BEAM-14421
>
> On Thu, May 5, 2022 at 9:14 AM Robert Bradshaw <rober...@google.com>
> wrote:
> >
> > On Wed, May 4, 2022 at 11:40 PM Marcin Kuthan <marcin.kut...@gmail.com>
> wrote:
> > >
> > > Hi
> > >
> > > I experimented a bit with Dataflow Runner v2 and I do not see a strong
> advantage for pure Java pipelines. I would like to hear the voices from
> more experienced Dataflow Runner v2 users:
> > >
> > > * Do you run your production Java pipelines using Dataflow Runner V2?
> > > * Streaming or batch?
> > > * Custom containers or default ones?
> > > * Have you observed any regression after migration from Dataflow
> Runner V1?
> >
> > Dataflow Runner v2 should mostly be an internal implementation detail,
> > although it does offer additional features that are not available for
> > Dataflow Runner v1 (e.g. fully splittable SDFs, multi-language, custom
> > containers). Eventually Dataflow will run all pipelines on this new
> > backend (e.g. all of Go and most of Python are already in this state)
> > though as you've noticed, you are welcome (and it's in fact
> > encouraged) to try it out sooner.
> >
> > I'd welcome any other feedback people have in trying it out here as well.
> >
> > > Finally, the WordCount example (Beam 2.38) does not work if only the
> following parameter is set "--dataflowServiceOptions=use_runner_v2".
> >
> > Thanks for reporting this! This is unexpected--the two should
> > essentially be aliases (the latter only named such as the name
> > "experiments" was off-putting for mature features). I'll look into it.
> > In the meantime, go ahead and use the experiments flag.
> >
> > > java.lang.RuntimeException: Failed to create a workflow job: Dataflow
> Runner v2 requires a valid FnApi job, Please resubmit your job with a valid
> configuration. Note that if using Templates, you may need to regenerate
> your template with the '--use_runner_v2'.
> > >     at org.apache.beam.runners.dataflow.DataflowRunner.run
> (DataflowRunner.java:1330)
> > >     at org.apache.beam.runners.dataflow.DataflowRunner.run
> (DataflowRunner.java:196)
> > >     at org.apache.beam.sdk.Pipeline.run (Pipeline.java:323)
> > >     at org.apache.beam.sdk.Pipeline.run (Pipeline.java:309)
> > >     at org.apache.beam.examples.WordCount.runWordCount
> (WordCount.java:196)
> > >     at org.apache.beam.examples.WordCount.main (WordCount.java:203)
> > >
> > > I'm able to run the job with "--experiments=use_runner_v2" option and
> it seems that runner v2 is enabled. The steps are different, especially for
> the reader part of the pipeline.
> > > If I specify both options "--dataflowServiceOptions=use_runner_v2" and
> "--experiments=use_runner_v2" the job looks identical to the previous one,
> so the dataflowServiceOptions seems to be redundant. But the official
> documentation says clearly to use the dataflowServiceOptions option to
> enable Dataflow Runner v2:
> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2
> .
> > >
> > > Thanks for sharing your thoughts
> > > Marcin Kuthan
>

Re: Dataflow runner v1 vs. v2 for Java pipelines

Reply via email to