Can you clarify a bit what you mean by being over-aggressive in the splitRestriction? We can't go any smaller as far as the unit of splittability (a single row group).
Thanks! -Claire On Tue, May 3, 2022 at 9:14 PM Robert Bradshaw <rober...@google.com> wrote: > On Tue, May 3, 2022 at 10:39 AM Claire McGinty > <claire.d.mcgi...@gmail.com> wrote: > > > > Hi Beam users, > > > > I'm looking for input on one of our IOs that we recently migrated to > SplittableDoFn. When running in Dataflow we saw performance gains in every > aspect (VCPU hours, total memory time) except for total elapsed time: the > SplittableDoFn implementation took 1.5x as many minutes as it did > previously for about ~900GB of Parquet files. > > > > It seems like the issue is that it isn't scaling up as much as the old > BoundedSource version. I ran the SplittableDoFn implementation a couple > times to be sure, but reliably, it only scaled up to 30%-50% the max number > of workers as it used to. Both implementations of this IO have the same > base level of "splittability" (Parquet row groups) so I'm not sure what the > issue could be. > > > > I saw in an older user@ thread, using Dataflow Runner V2 was suggested > as a mitigation. I did re-try my job using Dataflow Prime and saw > significant improvement; but we're not able to migrate our entire fleet to > V2 at this time. > > Note that you can pass use_runner_v2 to use Dataflow Runner V2 if > there are other Prime features that you're not ready for yet. (It > would be good to understand what issues you're running into as well, > if you're able to share.) > > > Is there any workaround for Dataflow Runner V1 to improve the scale-up > for SplittableDoFn sources? > > There are architectural constraints with Runner V1 in executing > SplittableDoFns as well as Runner V2 can do. Upgrading to Runner V2 > really is the best mitigation. But one possible migration might be to > be over-aggressive in your splitRestriction implementation. >