Re: [DISCUSS] FLIP-136: Improve interoperability between DataStream and Table API

Jark Wu Tue, 22 Sep 2020 23:55:52 -0700

Hi Timo,

Sorry for the late reply.
I think it would be great if we can make `withNamesAndPositions` internal
visible. This reduces the complexity of the public API.
It's hard to come up with a perfect solution. So let's move on this FLIP.
I don't have other concerns.


Best,
Jark

On Fri, 18 Sep 2020 at 22:14, Timo Walther <twal...@apache.org> wrote:

> Hi Jark,
>
> the fieldNames map is not intended for users. I would also be fine to
> make it a default scope constructor and access it with some internal
> utility class next to the Row class. The fieldNames map must only be
> used by serializers and converters. A user has no benefit in using it.
>
> For the creation of new rows (without reusing, which only advanced users
> usually do), I don't see a benefit of having:
>
> final Row reuse = new Row(Arrays.asList("myField", "myOtherField"))
> reuse.setField("myField", 12);
> reuse.setField("myOtherField", "This is a test");
>
> The purpose of Row.withName() is too create a Row easily and readable
> without declaring 50+ column names or dealing with indices in this range.
>
> Personally, I would like to make Row an interface and have concrete row
> implementations for different purposes but this would break existing
> programs too much.
>
> What do you think?
>
> Regards,
> Timo
>
>
> On 18.09.20 11:04, Jark Wu wrote:
> > Personally I think the fieldNames Map is confusing and not handy.
> > I just have an idea but not sure what you think.
> > What about adding a new constructor with List field names, this enables
> all
> > name-based setter/getters.
> > Regarding to List -> Map cost for every record, we can suggest users to
> > reuse the Row in the task.
> >
> > new Row(int arity)
> > new Row(List<String> fieldNames)
> >
> > final Row reuse = new Row(Arrays.asList("myField", "myOtherField"))
> > reuse.setField("myField", 12);
> > reuse.setField("myOtherField", "This is a test");
> >
> > My point is that, if we can have a handy constructor for named Row, we
> may
> > not need to distinguish the named-only or positionAndNamed mode.
> > This can avoid (fast-fail) the potential problem when setting an invalid
> > field.
> >
> > We can also come up with a new class for the field names which will
> > construct the Map and be shared among all Row instances.
> >
> > What do you think?
> >
> > Best,
> > Jark
> >
> > On Thu, 17 Sep 2020 at 16:48, Timo Walther <twal...@apache.org> wrote:
> >
> >> Hi everyone,
> >>
> >> thanks for all the feedback. I updated the FLIP again on Thursday to
> >> integrate the feedback I got from Jingsong and Jark offline. In
> >> particular I updated the `Improve dealing with Row in DataStream API`
> >> section another time. We introduced static methods for Row that should
> >> make the semantics clear to users:
> >>
> >> // allows to use index-based setters and getters (equivalent to new
> >> Row(int))
> >> // method exists for completeness
> >> public static withPositions(int length);
> >>
> >> // allows to use name-based setters and getters
> >> public static withNames();
> >>
> >> // allows to use both name-based and position-based setters and getters
> >> public static withNamesAndPositions(Map<String, Integer> fieldNames);
> >>
> >> In any case, non of the existing methods will be deprecated and only
> >> additional functionality will be available through the methods above.
> >>
> >> I started a voting thread on Friday. Please feel free to vote.
> >>
> >> Regards,
> >> Timo
> >>
> >> On 10.09.20 10:21, Danny Chan wrote:
> >>> Thanks for driving this Timo, +1 for voting ~
> >>>
> >>> Best,
> >>> Danny Chan
> >>> 在 2020年9月10日 +0800 PM3:54，Timo Walther <twal...@apache.org>，写道：
> >>>> Thanks everyone for this healthy discussion. I updated the FLIP with
> the
> >>>> outcome. I think the result is one of the last core API refactoring
> and
> >>>> users will be happy to have a consistent changelog support. Thanks for
> >>>> all the contributions.
> >>>>
> >>>> If there are no objections, I would continue with a voting.
> >>>>
> >>>> What do you think?
> >>>>
> >>>> Regards,
> >>>> Timo
> >>>>
> >>>>
> >>>> On 09.09.20 14:31, Danny Chan wrote:
> >>>>> Thanks, i'm fine with that.
> >>>>>
> >>>>>
> >>>>> Timo Walther <twal...@apache.org> 于2020年9月9日周三 下午7:02写道：
> >>>>>
> >>>>>> I agree with Jark. It reduces confusion.
> >>>>>>
> >>>>>> The DataStream API doesn't know changelog processing at all. A
> >>>>>> DataStream of Row can be used with both `fromDataStream` and
> >>>>>> `fromChangelogStream`. But only the latter API will interpret it as
> a
> >>>>>> changelog something.
> >>>>>>
> >>>>>> And as I mentioned before, the `toChangelogStream` must work with
> Row
> >>>>>> otherwise users are confused due to duplicate records with a missing
> >>>>>> changeflag.
> >>>>>>
> >>>>>> I will update the FLIP-136 a last time. I hope we can then continue
> >> to a
> >>>>>> vote.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Timo
> >>>>>>
> >>>>>>
> >>>>>> On 09.09.20 10:50, Danny Chan wrote:
> >>>>>>> I think it would bring in much confusion by a different API name
> just
> >>>>>> because the DataStream generic type is different.
> >>>>>>> If there are ChangelogMode that only works for Row, can we have a
> >> type
> >>>>>> check there ?
> >>>>>>>
> >>>>>>> Switch to a new API name does not really solve the problem well,
> >> people
> >>>>>> still need to declare the ChangelogMode explicitly, and there are
> some
> >>>>>> confusions:
> >>>>>>>
> >>>>>>> • Should DataStream of Row type always use #fromChangelogStream ?
> >>>>>>> • Does fromChangelogStream works for only INSERT ChangelogMode ?
> >>>>>>>
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Danny Chan
> >>>>>>> 在 2020年9月9日 +0800 PM4:21，Timo Walther <twal...@apache.org>，写道：
> >>>>>>>> I had this in the inital design, but Jark had concerns at least
> for
> >> the
> >>>>>>>> `toChangelogStream(ChangelogMode)` (see earlier discussion).
> >>>>>>>>
> >>>>>>>> `fromDataStream(dataStream, schema, changelogMode)` would be
> >> possible.
> >>>>>>>>
> >>>>>>>> But in this case I would vote for a symmetric API. If we keep
> >>>>>>>> toChangelogStream we should also have a fromChangelogStream.
> >>>>>>>>
> >>>>>>>> And if we unify `toChangelogStream` and `toDataStream`,
> retractions
> >>>>>>>> cannot be represented for non-Rows and users will experience
> >> duplicate
> >>>>>>>> records with a missing changeflag.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Timo
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 09.09.20 09:31, Danny Chan wrote:
> >>>>>>>>> “But I think the planner needs to
> >>>>>>>>> know whether the input is insert-only or not.”
> >>>>>>>>>
> >>>>>>>>> Does fromDataStream(dataStream, schema, changelogMode)
> >>>>>>>>>
> >>>>>>>>> solve your concerns ? People can pass around whatever
> ChangelogMode
> >>>>>> they like as an optional param.
> >>>>>>>>> By default: fromDataStream(dataStream, schema), the ChangelogMode
> >> is
> >>>>>> INSERT.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Danny Chan
> >>>>>>>>> 在 2020年9月9日 +0800 PM2:53，dev@flink.apache.org，写道：
> >>>>>>>>>>
> >>>>>>>>>> But I think the planner needs to
> >>>>>>>>>> know whether the input is insert-only or not.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: [DISCUSS] FLIP-136: Improve interoperability between DataStream and Table API

Reply via email to