Can you clarify a bit what you mean by being over-aggressive in the
splitRestriction? We can't go any smaller as far as the unit of
splittability (a single row group).

Thanks!
-Claire

On Tue, May 3, 2022 at 9:14 PM Robert Bradshaw <rober...@google.com> wrote:

> On Tue, May 3, 2022 at 10:39 AM Claire McGinty
> <claire.d.mcgi...@gmail.com> wrote:
> >
> > Hi Beam users,
> >
> > I'm looking for input on one of our IOs that we recently migrated to
> SplittableDoFn. When running in Dataflow we saw performance gains in every
> aspect (VCPU hours, total memory time) except for total elapsed time: the
> SplittableDoFn implementation took 1.5x as many minutes as it did
> previously for about ~900GB of Parquet files.
> >
> > It seems like the issue is that it isn't scaling up as much as the old
> BoundedSource version. I ran the SplittableDoFn implementation a couple
> times to be sure, but reliably, it only scaled up to 30%-50% the max number
> of workers as it used to. Both implementations of this IO have the same
> base level of "splittability" (Parquet row groups) so I'm not sure what the
> issue could be.
> >
> > I saw in an older user@ thread, using Dataflow Runner V2 was suggested
> as a mitigation. I did re-try my job using Dataflow Prime and saw
> significant improvement; but we're not able to migrate our entire fleet to
> V2 at this time.
>
> Note that you can pass use_runner_v2 to use Dataflow Runner V2 if
> there are other Prime features that you're not ready for yet. (It
> would be good to understand what issues you're running into as well,
> if you're able to share.)
>
> > Is there any workaround for Dataflow Runner V1 to improve the scale-up
> for SplittableDoFn sources?
>
> There are architectural constraints with Runner V1 in executing
> SplittableDoFns as well as Runner V2 can do. Upgrading to Runner V2
> really is the best mitigation. But one possible migration might be to
> be over-aggressive in your splitRestriction implementation.
>

Reply via email to