Hi all, Thanks a lot for the discussion. I'm also in favor of rewriting/redesigning the split/select API instead of removing them. It has been a consensus that the side output API can achieve all the functionalities of the split/select API. The problem is whether we should also support some easy-to-use APIs on top of it. IMO, we should do that as long as the APIs have clear semantic and wide usage scenario. I think split/select API is such a kind of API.
Regards, Dian > 在 2019年6月18日,上午12:30,xingc...@gmail.com 写道: > > Hi all, > > Thanks for sharing your thoughts on this topic. > > First, we must admit that the current implementation for split/select is > flawed. I roughly went through the source codes, the problem may be that for > consecutive select/split(s), the former one will be overridden by the later > one during StreamGraph generation phase. That's why we forbid this > consecutive logic in FLINK-11084. > > Now the question is whether we should guide users to migrate to the new side > output feature or thoroughly rework the broken API with the correct semantics > (instead of just trying to forbid all the "invalid" usages). > > Personally, I prefer the later solution because > > 1. The split/select may have been widely used without touching the broken > part. > 2. Though restricted compared with side output, the semantics for > split/select itself is acceptable since union does not support different data > types either. > 3. We need a complete and easy-to-use transformation set for DataStream API. > Enabling side output for flatMap may not be an ultimate solution. > > To summarize, maybe we should not easily deprecate the split/select public > API. If we come to a consensus on that, how about rewriting it based on side > output? (like the implementation for join on coGroup) > > Any feedback is welcome : ) > > Best, > Xingcan > > -----Original Message----- > From: SHI Xiaogang <shixiaoga...@gmail.com> > Sent: Monday, June 17, 2019 8:08 AM > To: Dawid Wysakowicz <dwysakow...@apache.org> > Cc: dev@flink.apache.org > Subject: Re: About Deprecating split/select for DataStream API > > Hi Dawid, > > Thanks a lot for your example. > > I think most users will expect splitted1 to be empty in the example. > > The unexpected results produced, in my opinion, is due to our problematic > implementation, instead of the confusing semantics. > We can fix the problem if we add a SELECT operator to filter out unexpected > records (Of course, we can find some optimization to improve the efficiency.). > > After all, i prefer to fix the problems to make the results as expected. > What do you think? > > Regards, > Xiaogang > > Dawid Wysakowicz <dwysakow...@apache.org> 于2019年6月17日周一 下午7:21写道: > >> Yes you are correct. The problem I described applies to the split not >> select as I wrote in the first email. Sorry for that. >> >> I will try to prepare a correct example. Let's have a look at this example: >> >> val splitted1 = ds.split(if (1) then "a") >> >> val splitted2 = ds.split(if (!=1) then "a") >> >> In those cases splitted1.select("a") -> will output all elements, the >> same for splitted2, because the OutputSelector(s) are applied to >> previous operator. The behavior I would assume is that splitted1 >> outputs only "1"s, whereas splitted2 all but "1"s >> >> On the other hand in a call >> >> val splitted1 = ds.split(if ("1" or "2") then >> "a").select("a").split(if ("3") then "b").select("b") >> >> I would assume an intersection of those two splits, so no results. >> What actually happens is that it will be "1", "2" & "3"s. Actually, >> right exceptions should be thrown in those cases not to produce >> confusing results, but this just shows that this API is broken, if we >> need to check for some prohibited configurations during runtime. >> >> Those weird behaviors are in my opinion results of the flawed API, as >> it actually assigns an output selector to the previous operator. In >> other words it modifies previous operator. I think it would be much >> cleaner if this happened inside an operator rather than separately. >> This is what SideOutputs do, as you define them inside the >> ProcessFunction, rather than afterwards. Therefore I am very much in >> favor of using them for those cases. Once again if the problem is that >> they are available only in the ProcessFunction I would prefer enabling >> them e.g. in FlatMap, rather than keeping the split/select. >> >> >> >> On 17/06/2019 09:40, SHI Xiaogang wrote: >>> Hi Dawid, >>> >>> As the select method is only allowed on SplitStreams, it's >>> impossible to construct the example ds.split().select("a", "b").select("c", >>> "d"). >>> >>> Are you meaning ds.split().select("a", "b").split().select("c", "d")? >>> If so, then the tagging in the first split operation should not >>> affect >> the >>> second one. Then >>> splitted.select("a", "b") => empty >>> splitted.select("c", "d") => ds >>> >>> I cannot quite catch your point here. It's appreciated if you can >> provide a >>> more concrete explanation? >>> >>> Regards, >>> Xiaogang Shi >>> >>> >>> >>> >>> Dawid Wysakowicz <dwysakow...@apache.org> 于2019年6月17日周一 下午3:10写道: >>> >>>> Hi all, >>>> >>>> Thank you for starting the discussion. To start with I have to say >>>> I am not entirely against leaving them. On the other hand I totally >>>> disagree that the semantics are clearly defined. Actually the >>>> design is fundamentally flawed. >>>> >>>> 1. We use String as a selector for elements. This is not the cleanest >>>> design, but I agree it is not the worst. >>>> 2. Users cannot define different types for different splits. >>>> 3. (The actual reason why I think it's actually better to drop the >>>> split/select and introduce a better mechanism) The behavior of a >> split is >>>> to actually add an output selector. We can have just a single >> selector on a >>>> single operator, but the API allows (I would even say >>>> encourages) to >> create >>>> chains of split/select, which leads to undefined behavior. Take >>>> this >> for >>>> example: ds.split().select("a", "b").select("c", "d"). Which >>>> tags >> should be >>>> forwarded? ("a", "b", "c", "d") (union) or () (intersection). In >>>> my >> opinion >>>> the most obvious answer in this case would be the intersection. Let's >>>> modify it slightly though and I would assume a different >>>> behavior >> (the >>>> union) >>>> >>>> splitted = ds.split(); >>>> >>>> splitted.select("a", "b").map() >>>> >>>> splitted.select("c", "d").map() >>>> >>>> Taking the 3rd argument into consideration I would be in favor of >> removing >>>> the current mechanism. I think the side outputs serve the purpose >>>> much better with much cleaner semantics. I get the argument that >>>> users are >> now >>>> forced to use processFunction if they want to use the side outputs. >>>> If >> this >>>> is the main problem how about enabling them e.g. for flatMap as well? >>>> >>>> Best, >>>> >>>> Dawid >>>> On 17/06/2019 08:51, Jark Wu wrote: >>>> >>>> +1 to keep the split/select API. I think if there are some problems >>>> +with >>>> the API, it's better to fix them instead of deprecating them. >>>> And select/split are straightforward and convenient APIs. It's >>>> worth to have them. >>>> >>>> Regards, >>>> Jark >>>> >>>> On Mon, 17 Jun 2019 at 14:46, vino yang <yanghua1...@gmail.com> < >> yanghua1...@gmail.com> wrote: >>>> >>>> >>>> Hi, >>>> >>>> I also think it is valuable and reasonable to keep the split/select >> APIs. >>>> They are very convenient and widely used in our platform. I think >>>> they >> are >>>> also used in other users' jobs. >>>> If the community has doubts about this, IMHO, it would be better to >> start a >>>> user survey. >>>> >>>> Best, >>>> Vino >>>> >>>> SHI Xiaogang <shixiaoga...@gmail.com> <shixiaoga...@gmail.com> >> 于2019年6月17日周一 上午11:55写道: >>>> >>>> >>>> Hi Xingcan, >>>> >>>> Thanks for bringing it up for discusson. >>>> >>>> I agree with you that we should not deprecate the split/select methods. >>>> Their semantics are very clear and they are widely adopted by Flink >>>> >>>> users. >>>> >>>> We should fix these problems instead of simply deprecating the methods. >>>> >>>> Regards, >>>> Xiaogang >>>> >>>> Xingcan Cui <xingc...@gmail.com> <xingc...@gmail.com> 于2019年6月15日周六 >> 下午4:13写道: >>>> >>>> >>>> Hi all, >>>> >>>> Recently, I noticed that the split/select methods in DataStream API >>>> >>>> have >>>> >>>> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA >>>> issue >>>> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> < >> https://issues.apache.org/jira/browse/FLINK-11084>). >>>> >>>> Although the two methods can be replaced by the more powerful side >>>> >>>> output >>>> >>>> feature[1], I still doubt whether we should really remove them in >>>> the future. >>>> >>>> 1. From semantics, the split/select is the reverse operation to the >>>> >>>> union >>>> >>>> transformation. Without them, the DataStream API seems to be >>>> missing a piece. >>>> >>>> 2. From accessibility, the side output only works for process >>>> >>>> functions, >>>> >>>> which means it forces the user to dive into a lower API. >>>> >>>> According to FLINK-11084 < >>>> https://issues.apache.org/jira/browse/FLINK-11084> < >> https://issues.apache.org/jira/browse/FLINK-11084>, there exist some >>>> problems with the current implementation of the two methods. Maybe >>>> we should fix the problems and re-active them again. Or if they >>>> really >>>> >>>> need >>>> >>>> to >>>> >>>> be deprecated, we should at least mark the corresponding >>>> documentation >>>> >>>> for >>>> >>>> that : ) >>>> >>>> What do you think? >>>> >>>> Best, >>>> Xingcan >>>> >>>> [1] >>>> >>>> >>>> >> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side >> _output.html >>>> >>>> < >>>> >>>> >>>> >> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side >> _output.html >>>> >>>> >> >> >