Hi all, I think we are getting closer to a consensus. I think most of us already agree that the current behavior is broken. The remaining difference I see is that I think those problems are caused by the design of the split/select method. The current contract of the split method is that it is actually applied to the previous operation, rather than it creates a new operator. (I don't think this is an implementation detail. This is the contract of the API. It was described in docs, conference talks, workshops etc.).
I agree that we could reimplement the split with side outputs in a way that it would add additional operator in a chain and emit results via side outputs. This would change the core concepts of the split method though, making the "fixed" split method a completely new one with a new behavior. Therefore I am in favor of dropping the old method and introducing a new one. This would be an explicit information for the users that this is something different, so that they can make a conscious choice. Moreover if we reimplement the split method in a way that it introduces additional operator and emits results via side outputs, this is basically equivalent to enabling side outputs in a flatMap method. You can think of the split method as a flatMap method. I see no benefit of having additional split method that would have a limited functionality of such flatMap with side outputs. So to sum up my stance: 1. I am against changing the behavior of current split/select methods. 2. I would be ok with introducing a similar, but new methods, with a new behavior (but would prefer not to do that, we could achieve the same or even more with existing APIs) 3. I would be in favor of enabling side outputs for flatMap method (if it is possible) Best, Dawid On 18/06/2019 03:45, Dian Fu wrote: > Hi all, > > Thanks a lot for the discussion. I'm also in favor of rewriting/redesigning > the split/select API instead of removing them. It has been a consensus that > the side output API can achieve all the functionalities of the split/select > API. The problem is whether we should also support some easy-to-use APIs on > top of it. IMO, we should do that as long as the APIs have clear semantic and > wide usage scenario. I think split/select API is such a kind of API. > > Regards, > Dian > >> 在 2019年6月18日,上午12:30,xingc...@gmail.com 写道: >> >> Hi all, >> >> Thanks for sharing your thoughts on this topic. >> >> First, we must admit that the current implementation for split/select is >> flawed. I roughly went through the source codes, the problem may be that for >> consecutive select/split(s), the former one will be overridden by the later >> one during StreamGraph generation phase. That's why we forbid this >> consecutive logic in FLINK-11084. >> >> Now the question is whether we should guide users to migrate to the new side >> output feature or thoroughly rework the broken API with the correct >> semantics (instead of just trying to forbid all the "invalid" usages). >> >> Personally, I prefer the later solution because >> >> 1. The split/select may have been widely used without touching the broken >> part. >> 2. Though restricted compared with side output, the semantics for >> split/select itself is acceptable since union does not support different >> data types either. >> 3. We need a complete and easy-to-use transformation set for DataStream API. >> Enabling side output for flatMap may not be an ultimate solution. >> >> To summarize, maybe we should not easily deprecate the split/select public >> API. If we come to a consensus on that, how about rewriting it based on side >> output? (like the implementation for join on coGroup) >> >> Any feedback is welcome : ) >> >> Best, >> Xingcan >> >> -----Original Message----- >> From: SHI Xiaogang <shixiaoga...@gmail.com> >> Sent: Monday, June 17, 2019 8:08 AM >> To: Dawid Wysakowicz <dwysakow...@apache.org> >> Cc: dev@flink.apache.org >> Subject: Re: About Deprecating split/select for DataStream API >> >> Hi Dawid, >> >> Thanks a lot for your example. >> >> I think most users will expect splitted1 to be empty in the example. >> >> The unexpected results produced, in my opinion, is due to our problematic >> implementation, instead of the confusing semantics. >> We can fix the problem if we add a SELECT operator to filter out unexpected >> records (Of course, we can find some optimization to improve the >> efficiency.). >> >> After all, i prefer to fix the problems to make the results as expected. >> What do you think? >> >> Regards, >> Xiaogang >> >> Dawid Wysakowicz <dwysakow...@apache.org> 于2019年6月17日周一 下午7:21写道: >> >>> Yes you are correct. The problem I described applies to the split not >>> select as I wrote in the first email. Sorry for that. >>> >>> I will try to prepare a correct example. Let's have a look at this example: >>> >>> val splitted1 = ds.split(if (1) then "a") >>> >>> val splitted2 = ds.split(if (!=1) then "a") >>> >>> In those cases splitted1.select("a") -> will output all elements, the >>> same for splitted2, because the OutputSelector(s) are applied to >>> previous operator. The behavior I would assume is that splitted1 >>> outputs only "1"s, whereas splitted2 all but "1"s >>> >>> On the other hand in a call >>> >>> val splitted1 = ds.split(if ("1" or "2") then >>> "a").select("a").split(if ("3") then "b").select("b") >>> >>> I would assume an intersection of those two splits, so no results. >>> What actually happens is that it will be "1", "2" & "3"s. Actually, >>> right exceptions should be thrown in those cases not to produce >>> confusing results, but this just shows that this API is broken, if we >>> need to check for some prohibited configurations during runtime. >>> >>> Those weird behaviors are in my opinion results of the flawed API, as >>> it actually assigns an output selector to the previous operator. In >>> other words it modifies previous operator. I think it would be much >>> cleaner if this happened inside an operator rather than separately. >>> This is what SideOutputs do, as you define them inside the >>> ProcessFunction, rather than afterwards. Therefore I am very much in >>> favor of using them for those cases. Once again if the problem is that >>> they are available only in the ProcessFunction I would prefer enabling >>> them e.g. in FlatMap, rather than keeping the split/select. >>> >>> >>> >>> On 17/06/2019 09:40, SHI Xiaogang wrote: >>>> Hi Dawid, >>>> >>>> As the select method is only allowed on SplitStreams, it's >>>> impossible to construct the example ds.split().select("a", >>>> "b").select("c", "d"). >>>> >>>> Are you meaning ds.split().select("a", "b").split().select("c", "d")? >>>> If so, then the tagging in the first split operation should not >>>> affect >>> the >>>> second one. Then >>>> splitted.select("a", "b") => empty >>>> splitted.select("c", "d") => ds >>>> >>>> I cannot quite catch your point here. It's appreciated if you can >>> provide a >>>> more concrete explanation? >>>> >>>> Regards, >>>> Xiaogang Shi >>>> >>>> >>>> >>>> >>>> Dawid Wysakowicz <dwysakow...@apache.org> 于2019年6月17日周一 下午3:10写道: >>>> >>>>> Hi all, >>>>> >>>>> Thank you for starting the discussion. To start with I have to say >>>>> I am not entirely against leaving them. On the other hand I totally >>>>> disagree that the semantics are clearly defined. Actually the >>>>> design is fundamentally flawed. >>>>> >>>>> 1. We use String as a selector for elements. This is not the cleanest >>>>> design, but I agree it is not the worst. >>>>> 2. Users cannot define different types for different splits. >>>>> 3. (The actual reason why I think it's actually better to drop the >>>>> split/select and introduce a better mechanism) The behavior of a >>> split is >>>>> to actually add an output selector. We can have just a single >>> selector on a >>>>> single operator, but the API allows (I would even say >>>>> encourages) to >>> create >>>>> chains of split/select, which leads to undefined behavior. Take >>>>> this >>> for >>>>> example: ds.split().select("a", "b").select("c", "d"). Which >>>>> tags >>> should be >>>>> forwarded? ("a", "b", "c", "d") (union) or () (intersection). In >>>>> my >>> opinion >>>>> the most obvious answer in this case would be the intersection. Let's >>>>> modify it slightly though and I would assume a different >>>>> behavior >>> (the >>>>> union) >>>>> >>>>> splitted = ds.split(); >>>>> >>>>> splitted.select("a", "b").map() >>>>> >>>>> splitted.select("c", "d").map() >>>>> >>>>> Taking the 3rd argument into consideration I would be in favor of >>> removing >>>>> the current mechanism. I think the side outputs serve the purpose >>>>> much better with much cleaner semantics. I get the argument that >>>>> users are >>> now >>>>> forced to use processFunction if they want to use the side outputs. >>>>> If >>> this >>>>> is the main problem how about enabling them e.g. for flatMap as well? >>>>> >>>>> Best, >>>>> >>>>> Dawid >>>>> On 17/06/2019 08:51, Jark Wu wrote: >>>>> >>>>> +1 to keep the split/select API. I think if there are some problems >>>>> +with >>>>> the API, it's better to fix them instead of deprecating them. >>>>> And select/split are straightforward and convenient APIs. It's >>>>> worth to have them. >>>>> >>>>> Regards, >>>>> Jark >>>>> >>>>> On Mon, 17 Jun 2019 at 14:46, vino yang <yanghua1...@gmail.com> < >>> yanghua1...@gmail.com> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I also think it is valuable and reasonable to keep the split/select >>> APIs. >>>>> They are very convenient and widely used in our platform. I think >>>>> they >>> are >>>>> also used in other users' jobs. >>>>> If the community has doubts about this, IMHO, it would be better to >>> start a >>>>> user survey. >>>>> >>>>> Best, >>>>> Vino >>>>> >>>>> SHI Xiaogang <shixiaoga...@gmail.com> <shixiaoga...@gmail.com> >>> 于2019年6月17日周一 上午11:55写道: >>>>> >>>>> Hi Xingcan, >>>>> >>>>> Thanks for bringing it up for discusson. >>>>> >>>>> I agree with you that we should not deprecate the split/select methods. >>>>> Their semantics are very clear and they are widely adopted by Flink >>>>> >>>>> users. >>>>> >>>>> We should fix these problems instead of simply deprecating the methods. >>>>> >>>>> Regards, >>>>> Xiaogang >>>>> >>>>> Xingcan Cui <xingc...@gmail.com> <xingc...@gmail.com> 于2019年6月15日周六 >>> 下午4:13写道: >>>>> >>>>> Hi all, >>>>> >>>>> Recently, I noticed that the split/select methods in DataStream API >>>>> >>>>> have >>>>> >>>>> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA >>>>> issue >>>>> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> < >>> https://issues.apache.org/jira/browse/FLINK-11084>). >>>>> Although the two methods can be replaced by the more powerful side >>>>> >>>>> output >>>>> >>>>> feature[1], I still doubt whether we should really remove them in >>>>> the future. >>>>> >>>>> 1. From semantics, the split/select is the reverse operation to the >>>>> >>>>> union >>>>> >>>>> transformation. Without them, the DataStream API seems to be >>>>> missing a piece. >>>>> >>>>> 2. From accessibility, the side output only works for process >>>>> >>>>> functions, >>>>> >>>>> which means it forces the user to dive into a lower API. >>>>> >>>>> According to FLINK-11084 < >>>>> https://issues.apache.org/jira/browse/FLINK-11084> < >>> https://issues.apache.org/jira/browse/FLINK-11084>, there exist some >>>>> problems with the current implementation of the two methods. Maybe >>>>> we should fix the problems and re-active them again. Or if they >>>>> really >>>>> >>>>> need >>>>> >>>>> to >>>>> >>>>> be deprecated, we should at least mark the corresponding >>>>> documentation >>>>> >>>>> for >>>>> >>>>> that : ) >>>>> >>>>> What do you think? >>>>> >>>>> Best, >>>>> Xingcan >>>>> >>>>> [1] >>>>> >>>>> >>>>> >>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side >>> _output.html >>>>> < >>>>> >>>>> >>>>> >>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side >>> _output.html >>>>> >>>
signature.asc
Description: OpenPGP digital signature