Reshuffle is perfectly fine to use if the goal is just to redistribute work. It's only deprecated as a "checkpointing" mechanism.
On Fri, Jan 19, 2024 at 9:44 AM Danny McCormick via user <user@beam.apache.org> wrote: > > For runners that support Reshuffle, it should be safe to use. Its been > "deprecated" for 7 years, but is still heavily used/often the recommended way > to do things like this. I actually just added a PR to undeprecate it earlier > today. Looks like you're using Dataflow, which also has always supported > ReShuffle. > > > Also I looked at the code, reshuffle seems doing some groupby work > > internally. But I don't really need groupby > > Groupby is basically an implementation detail that creates the desired > shuffling behavior in many runners (runners can also override transform > implementations if needed for some primitives like this, but that's another > can of worms). Basically, in order to prevent fusion you need some operation > that does this and GroupBy is one option. > > Given that you're using DataFlow, I'd also recommend checking out > https://cloud.google.com/dataflow/docs/pipeline-lifecycle#prevent_fusion > which describes how to do this in more detail. > > Thanks, > Danny > > On Fri, Jan 19, 2024 at 12:36 PM hsy...@gmail.com <hsy...@gmail.com> wrote: >> >> Also I looked at the code, reshuffle seems doing some groupby work >> internally. But I don't really need groupby >> >> On Fri, Jan 19, 2024 at 9:35 AM hsy...@gmail.com <hsy...@gmail.com> wrote: >>> >>> ReShuffle is deprecated >>> >>> On Fri, Jan 19, 2024 at 8:25 AM XQ Hu via user <user@beam.apache.org> wrote: >>>> >>>> I do not think it enforces a reshuffle by just checking the doc here: >>>> https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html?highlight=withkeys#apache_beam.transforms.util.WithKeys >>>> >>>> Have you tried to just add ReShuffle after PubsubLiteIO? >>>> >>>> On Thu, Jan 18, 2024 at 8:54 PM hsy...@gmail.com <hsy...@gmail.com> wrote: >>>>> >>>>> Hey guys, >>>>> >>>>> I have a question, does withkeys transformation enforce a reshuffle? >>>>> >>>>> My pipeline basically look like this PubsubLiteIO -> ParDo(..) -> ParDo() >>>>> -> BigqueryIO.write() >>>>> >>>>> The problem is PubsubLiteIO -> ParDo(..) -> ParDo() always fused >>>>> together. But The ParDo is expensive and I want dataflow to have more >>>>> workers to work on that, what's the best way to do that? >>>>> >>>>> Regards, >>>>>