Re: Evolving a Coder for an added field

Robert Bradshaw Tue, 06 Nov 2018 01:39:01 -0800
Yes, a Coder author should be able to register a URN with a mapping
from (components + payload) -> Coder (and vice versa), and this should
be more lightweight than manually editing the proto files.
On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <[email protected]> wrote:
>
> +1
>
> I think that coders should be immutable/versioned. The SDK should know about 
> all the available versions and be able to associate the data (stream or at 
> rest) with the corresponding coder version via URN. We can also look how that 
> is solved elsewhere, for example the Kafka schema registry.
>
> Today we only have a few URNs for standard coders: 
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617
>
> I imagine we will need a coder registry where IOs and users can add their 
> versioned coders also?
>
> Thanks,
> Thomas
>
>
> On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <[email protected]> wrote:
>>
>> It makes sense to have a more concrete URN including the version.
>>
>> Good idea Robert.
>>
>> Regards
>> JB
>>
>> On 05/11/2018 16:52, Robert Bradshaw wrote:
>> > I think we'll want to allow upgrades across SDK versions. A runner
>> > should be able to recognize when a coder (or any other aspect of the
>> > pipeline) has changed and adapt/reject accordingly. (Until we remove
>> > coders from sources/sinks, there's also possibly the expectation that
>> > one should be able to read data from a source written with that same
>> > coder across versions as well.)
>> >
>> > I think it really comes down to how coders are named. If we decide to
>> > let coders change arbitrarily between versions, probably the URN for
>> > SerializedJavaCoder should have the SDK version number in it. Coders
>> > that are stable across SDKs can have better, more stable URNs defined
>> > and registered.
>> >
>> > I am more OK with changing the registry to infer different coders as
>> > the SDK evolves (which would be detected and manually overwritten with
>> > the old ones, on a case-by-case basis, if they still exist). This
>> > should still be done with caution as it will make upgrading harder.
>> > Highly composite, experimental coders should possibly be designed in
>> > an intrinsically extensible way.
>> >
>> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <[email protected]> 
>> > wrote:
>> >>
>> >> That's really a pita. It's an important and impacting change.
>> >>
>> >> I would go to 1.
>> >>
>> >> For LTS, as already said, I would create a LTS branch and only cherry
>> >> pick some changes. Using master as LTS release branch won't work IMHO.
>> >>
>> >> Regards
>> >> JB
>> >>
>> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
>> >>> For some extra context this change touches more than FileIO, in
>> >>> reality this will affect updates in any file-based pipelines because
>> >>> the metadata on each file will have now an extra field for the
>> >>> lastModifiedDate.
>> >>>
>> >>> The PR looks perfect, only issue is the backwards compatibility Coder
>> >>> question. Knowing that probably Dataflow is the only one affected, I
>> >>> would like to know what can we do?
>> >>>
>> >>> [1] Should we merge and the Coder updatability be tied to SDK versions
>> >>> (which makes sense and is probably more aligned with the LTS
>> >>> discussion)?
>> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated
>> >>> Matadata object) ? In this case where is the right place to identify
>> >>> and decide what coder to use?
>> >>>
>> >>> Other ideas... ?
>> >>>
>> >>> Last thing, the link that Luke shared does not seem to work (looks
>> >>> like a googley-friendly URL, here it is the full URL for those
>> >>> interested in the drain/update proposal:
>> >>>
>> >>> [2] 
>> >>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
>> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <[email protected]> wrote:
>> >>>>
>> >>>> I think the idea is that you would use one coder for paths where you 
>> >>>> don't need this information and would have FileIO provide a separate 
>> >>>> path that uses your updated coder.
>> >>>> Existing users would not be impacted and users of the new FileIO that 
>> >>>> depend on this information would not be able to have updated their 
>> >>>> pipeline in the first place.
>> >>>>
>> >>>> If the feature in FileIO is experimental, we could choose to break it 
>> >>>> for existing users though since I don't know how feasible my suggestion 
>> >>>> above is.
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <[email protected]> wrote:
>> >>>>>
>> >>>>> Lukasz - Thanks for those links. That's very helpful context.
>> >>>>>
>> >>>>> It sounds like there's no explicit user contract about evolving Coder 
>> >>>>> classes in the Java SDK and users might reasonably assume Coders to be 
>> >>>>> stable between SDK versions. Thus, users of the Dataflow or Flink 
>> >>>>> runners might reasonably expect that they can update the Java SDK 
>> >>>>> version used in their pipeline when performing an update.
>> >>>>>
>> >>>>> Based in that understanding, evolving a class like Metadata might not 
>> >>>>> be possible except in a major version bump where it's obvious to users 
>> >>>>> to expect breaking changes and not to expect an "update" operation to 
>> >>>>> work.
>> >>>>>
>> >>>>> It's not clear to me what changing the "name" of a coder would look 
>> >>>>> like or whether that's a tenable solution here. Would that change be 
>> >>>>> able to happen within the SDK itself, or is it something users would 
>> >>>>> need to specify?
>> >>
>> >> --
>> >> Jean-Baptiste Onofré
>> >> [email protected]
>> >> http://blog.nanthrax.net
>> >> Talend - http://www.talend.com
>>
>> --
>> Jean-Baptiste Onofré
>> [email protected]
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
Re: Evolving a Coder for an added field

Reply via email to