Given that this is a hint, I'm not sure redistribute should be a PTransform as opposed to some other way to hint to a runner.
I'm not sure of what the syntax of that would be, but a semantic no-op transform that the runner may or may not do anything with is odd. On Thu, Oct 5, 2023 at 11:30 AM Kenneth Knowles <k...@apache.org> wrote: > So a high level suggestion from Robert that I want to highlight as a > top-post: > > Instead of focusing on just fixing the SDKs and runners Reshuffle, this > could be an opportunity to introduce Redistribute which was proposed in the > long-ago thread. The semantics are identical but it is more clear that it > *only* is a hint about redistributing data and there is no expectation of > a checkpoint. > > This new name may also be an opportunity to maintain update compatibility > (though this may actually be leaving unsafe code in user's hands) and/or > separate @RequiresStableInput/checkpointing uses of Reshuffle from > redistribution-only uses of Reshuffle. > > Any other thoughts on this one high level bit? > > Kenn > > On Thu, Oct 5, 2023 at 11:15 AM Kenneth Knowles <k...@apache.org> wrote: > >> >> On Wed, Oct 4, 2023 at 7:45 PM Robert Burke <lostl...@apache.org> wrote: >> >>> LGTM. >>> >>> It looks the Go SDK already adheres to these semantics as well for the >>> reference impl(well, reshuffle/redistribute_randomly, _by_key isn't >>> implemented in the Go SDK, and only uses the existing unqualified reshuffle >>> URN [0]. >>> >>> The original strategy, and then for every element, the original Window, >>> TS, and Pane are all serialized, shuffled, and then deserialized downstream. >>> >>> >>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L65 >>> >>> >>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L145 >>> >>> Prism at the moment vaccuously implements reshuffle by omitting the >>> node, and rewriting the inputs and outputs [1], as it's a local runner with >>> single transform per bundle execution, but I was intending to make it a >>> fusion break regardless. Ultimately prism's "test" variant will default to >>> executing the SDKs dictated reference implementation for the composite(s), >>> and any "fast" or "prod" variant would simply do the current implementation. >>> >> >> Very nice! >> >> And of course I should have linked out to the existing reshuffle URN in >> the proto. >> >> Kenn >> >> >> >>> >>> Robert Burke >>> Beam Go Busybody >>> >>> [0]: >>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/graphx/translate.go#L46C3-L46C50 >>> [1]: >>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/handlerunner.go#L82 >>> >>> >>> >>> On 2023/09/26 15:43:53 Kenneth Knowles wrote: >>> > Hi everyone, >>> > >>> > Recently there was a bug [1] caused by discrepancies between two of >>> > Dataflow's reshuffle implementations. I think the reference >>> implementation >>> > in the Java SDK [2] also does not match. This all led to discussion on >>> the >>> > bug and the pull request [3] about what the actual semantics should >>> be. I >>> > got it wrong, maybe multiple times. So I wrote up a very short >>> document to >>> > finish the discussion: >>> > >>> > https://s.apache.org/beam-reshuffle >>> > >>> > This is also probably among the simplest imaginable use of >>> > http://s.apache.org/ptransform-design-doc in case you want to see >>> kind of >>> > how I intended it to be used. >>> > >>> > Kenn >>> > >>> > [1] https://github.com/apache/beam/issues/28219 >>> > [2] >>> > >>> https://github.com/apache/beam/blob/d52b077ad505c8b50f10ec6a4eb83d385cdaf96a/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L84 >>> > [3] https://github.com/apache/beam/pull/28272 >>> > >>> >>