So a high level suggestion from Robert that I want to highlight as a
top-post:

Instead of focusing on just fixing the SDKs and runners Reshuffle, this
could be an opportunity to introduce Redistribute which was proposed in the
long-ago thread. The semantics are identical but it is more clear that it
*only* is a hint about redistributing data and there is no expectation of a
checkpoint.

This new name may also be an opportunity to maintain update compatibility
(though this may actually be leaving unsafe code in user's hands) and/or
separate @RequiresStableInput/checkpointing uses of Reshuffle from
redistribution-only uses of Reshuffle.

Any other thoughts on this one high level bit?

Kenn

On Thu, Oct 5, 2023 at 11:15 AM Kenneth Knowles <k...@apache.org> wrote:

>
> On Wed, Oct 4, 2023 at 7:45 PM Robert Burke <lostl...@apache.org> wrote:
>
>> LGTM.
>>
>> It looks the Go SDK already adheres to these semantics as well for the
>> reference impl(well, reshuffle/redistribute_randomly, _by_key isn't
>> implemented in the Go SDK, and only uses the existing unqualified reshuffle
>> URN [0].
>>
>> The original strategy, and then for every element, the original Window,
>> TS, and Pane are all serialized, shuffled, and then deserialized downstream.
>>
>>
>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L65
>>
>>
>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L145
>>
>> Prism at the moment vaccuously implements reshuffle by omitting the node,
>> and rewriting the inputs and outputs [1], as it's a local runner with
>> single transform per bundle execution, but I was intending to make it a
>> fusion break regardless.  Ultimately prism's "test" variant will default to
>> executing the SDKs dictated reference implementation for the composite(s),
>> and any "fast" or "prod" variant would simply do the current implementation.
>>
>
> Very nice!
>
> And of course I should have linked out to the existing reshuffle URN in
> the proto.
>
> Kenn
>
>
>
>>
>> Robert Burke
>> Beam Go Busybody
>>
>> [0]:
>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/graphx/translate.go#L46C3-L46C50
>> [1]:
>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/handlerunner.go#L82
>>
>>
>>
>> On 2023/09/26 15:43:53 Kenneth Knowles wrote:
>> > Hi everyone,
>> >
>> > Recently there was a bug [1] caused by discrepancies between two of
>> > Dataflow's reshuffle implementations. I think the reference
>> implementation
>> > in the Java SDK [2] also does not match. This all led to discussion on
>> the
>> > bug and the pull request [3] about what the actual semantics should be.
>> I
>> > got it wrong, maybe multiple times. So I wrote up a very short document
>> to
>> > finish the discussion:
>> >
>> >     https://s.apache.org/beam-reshuffle
>> >
>> > This is also probably among the simplest imaginable use of
>> > http://s.apache.org/ptransform-design-doc in case you want to see kind
>> of
>> > how I intended it to be used.
>> >
>> > Kenn
>> >
>> > [1] https://github.com/apache/beam/issues/28219
>> > [2]
>> >
>> https://github.com/apache/beam/blob/d52b077ad505c8b50f10ec6a4eb83d385cdaf96a/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L84
>> > [3] https://github.com/apache/beam/pull/28272
>> >
>>
>

Reply via email to