Reshuffle is perfectly fine to use if the goal is just to redistribute
work. It's only deprecated as a "checkpointing" mechanism.

On Fri, Jan 19, 2024 at 9:44 AM Danny McCormick via user
<user@beam.apache.org> wrote:
>
> For runners that support Reshuffle, it should be safe to use. Its been 
> "deprecated" for 7 years, but is still heavily used/often the recommended way 
> to do things like this. I actually just added a PR to undeprecate it earlier 
> today. Looks like you're using Dataflow, which also has always supported 
> ReShuffle.
>
> > Also I looked at the code, reshuffle seems doing some groupby work 
> > internally. But I don't really need groupby
>
> Groupby is basically an implementation detail that creates the desired 
> shuffling behavior in many runners (runners can also override transform 
> implementations if needed for some primitives like this, but that's another 
> can of worms). Basically, in order to prevent fusion you need some operation 
> that does this and GroupBy is one option.
>
> Given that you're using DataFlow, I'd also recommend checking out 
> https://cloud.google.com/dataflow/docs/pipeline-lifecycle#prevent_fusion 
> which describes how to do this in more detail.
>
> Thanks,
> Danny
>
> On Fri, Jan 19, 2024 at 12:36 PM hsy...@gmail.com <hsy...@gmail.com> wrote:
>>
>> Also I looked at the code, reshuffle seems doing some groupby work 
>> internally. But I don't really need groupby
>>
>> On Fri, Jan 19, 2024 at 9:35 AM hsy...@gmail.com <hsy...@gmail.com> wrote:
>>>
>>> ReShuffle is deprecated
>>>
>>> On Fri, Jan 19, 2024 at 8:25 AM XQ Hu via user <user@beam.apache.org> wrote:
>>>>
>>>> I do not think it enforces a reshuffle by just checking the doc here: 
>>>> https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html?highlight=withkeys#apache_beam.transforms.util.WithKeys
>>>>
>>>> Have you tried to just add ReShuffle after PubsubLiteIO?
>>>>
>>>> On Thu, Jan 18, 2024 at 8:54 PM hsy...@gmail.com <hsy...@gmail.com> wrote:
>>>>>
>>>>> Hey guys,
>>>>>
>>>>> I have a question, does withkeys transformation enforce a reshuffle?
>>>>>
>>>>> My pipeline basically look like this PubsubLiteIO -> ParDo(..) -> ParDo() 
>>>>> -> BigqueryIO.write()
>>>>>
>>>>> The problem is PubsubLiteIO -> ParDo(..) -> ParDo() always fused 
>>>>> together. But The ParDo is expensive and I want dataflow to have more 
>>>>> workers to work on that, what's the best way to do that?
>>>>>
>>>>> Regards,
>>>>>

Reply via email to