Re: About Deprecating split/select for DataStream API

Dian Fu Mon, 17 Jun 2019 18:46:25 -0700

Hi all,

Thanks a lot for the discussion. I'm also in favor of rewriting/redesigning the 
split/select API instead of removing them. It has been a consensus that the 
side output API can achieve all the functionalities of the split/select API. 
The problem is whether we should also support some easy-to-use APIs on top of 
it. IMO, we should do that as long as the APIs have clear semantic and wide 
usage scenario. I think split/select API is such a kind of API.


Regards,
Dian

> 在 2019年6月18日，上午12:30，xingc...@gmail.com 写道：
> 
> Hi all,
> 
> Thanks for sharing your thoughts on this topic.
> 
> First, we must admit that the current implementation for split/select is 
> flawed. I roughly went through the source codes, the problem may be that for 
> consecutive select/split(s), the former one will be overridden by the later 
> one during StreamGraph generation phase. That's why we forbid this 
> consecutive logic in FLINK-11084.
> 
> Now the question is whether we should guide users to migrate to the new side 
> output feature or thoroughly rework the broken API with the correct semantics 
> (instead of just trying to forbid all the "invalid" usages). 
> 
> Personally, I prefer the later solution because
> 
> 1. The split/select may have been widely used without touching the broken 
> part.
> 2. Though restricted compared with side output, the semantics for 
> split/select itself is acceptable since union does not support different data 
> types either.
> 3. We need a complete and easy-to-use transformation set for DataStream API. 
> Enabling side output for flatMap may not be an ultimate solution.
> 
> To summarize, maybe we should not easily deprecate the split/select public 
> API. If we come to a consensus on that, how about rewriting it based on side 
> output? (like the implementation for join on coGroup)
> 
> Any feedback is welcome : )
> 
> Best,
> Xingcan
> 
> -----Original Message-----
> From: SHI Xiaogang <shixiaoga...@gmail.com> 
> Sent: Monday, June 17, 2019 8:08 AM
> To: Dawid Wysakowicz <dwysakow...@apache.org>
> Cc: dev@flink.apache.org
> Subject: Re: About Deprecating split/select for DataStream API
> 
> Hi Dawid,
> 
> Thanks a lot for your example.
> 
> I think most users will expect splitted1 to be empty in the example.
> 
> The unexpected results produced, in my opinion, is due to our problematic 
> implementation, instead of the confusing semantics.
> We can fix the problem if we add a SELECT operator to filter out unexpected 
> records (Of course, we can find some optimization to improve the efficiency.).
> 
> After all, i prefer to fix the problems to make the results as expected.
> What do you think?
> 
> Regards,
> Xiaogang
> 
> Dawid Wysakowicz <dwysakow...@apache.org> 于2019年6月17日周一 下午7:21写道：
> 
>> Yes you are correct. The problem I described applies to the split not 
>> select as I wrote in the first email. Sorry for that.
>> 
>> I will try to prepare a correct example. Let's have a look at this example:
>> 
>>    val splitted1 = ds.split(if (1) then "a")
>> 
>>    val splitted2 = ds.split(if (!=1) then "a")
>> 
>> In those cases splitted1.select("a") -> will output all elements, the 
>> same for splitted2, because the OutputSelector(s) are applied to 
>> previous operator. The behavior I would assume is that splitted1 
>> outputs only "1"s, whereas splitted2 all but "1"s
>> 
>> On the other hand in a call
>> 
>>    val splitted1 = ds.split(if ("1" or "2") then 
>> "a").select("a").split(if ("3") then "b").select("b")
>> 
>> I would assume an intersection of those two splits, so no results. 
>> What actually happens is that it will be "1", "2" & "3"s. Actually, 
>> right exceptions should be thrown in those cases not to produce 
>> confusing results, but this just shows that this API is broken, if we 
>> need to check for some prohibited configurations during runtime.
>> 
>> Those weird behaviors are in my opinion results of the flawed API, as 
>> it actually assigns an output selector to the previous operator. In 
>> other words it modifies previous operator. I think it would be much 
>> cleaner if this happened inside an operator rather than separately. 
>> This is what SideOutputs do, as you define them inside the 
>> ProcessFunction, rather than afterwards. Therefore I am very much in 
>> favor of using them for those cases. Once again if the problem is that 
>> they are available only in the ProcessFunction I would prefer enabling 
>> them e.g. in FlatMap, rather than keeping the split/select.
>> 
>> 
>> 
>> On 17/06/2019 09:40, SHI Xiaogang wrote:
>>> Hi Dawid,
>>> 
>>> As the select method is only allowed on SplitStreams, it's 
>>> impossible to construct the example ds.split().select("a", "b").select("c", 
>>> "d").
>>> 
>>> Are you meaning ds.split().select("a", "b").split().select("c", "d")?
>>> If so, then the tagging in the first split operation should not 
>>> affect
>> the
>>> second one. Then
>>>    splitted.select("a", "b") => empty
>>>    splitted.select("c", "d") => ds
>>> 
>>> I cannot quite catch your point here. It's appreciated if you can
>> provide a
>>> more concrete explanation?
>>> 
>>> Regards,
>>> Xiaogang Shi
>>> 
>>> 
>>> 
>>> 
>>> Dawid Wysakowicz <dwysakow...@apache.org> 于2019年6月17日周一 下午3:10写道：
>>> 
>>>> Hi all,
>>>> 
>>>> Thank you for starting the discussion. To start with I have to say 
>>>> I am not entirely against leaving them. On the other hand I totally 
>>>> disagree that the semantics are clearly defined. Actually the 
>>>> design is fundamentally flawed.
>>>> 
>>>>   1. We use String as a selector for elements. This is not the cleanest
>>>>   design, but I agree it is not the worst.
>>>>   2. Users cannot define different types for different splits.
>>>>   3. (The actual reason why I think it's actually better to drop the
>>>>   split/select and introduce a better mechanism) The behavior of a
>> split is
>>>>   to actually add an output selector. We can have just a single
>> selector on a
>>>>   single operator, but the API allows (I would even say 
>>>> encourages) to
>> create
>>>>   chains of split/select, which leads to undefined behavior. Take 
>>>> this
>> for
>>>>   example: ds.split().select("a", "b").select("c", "d"). Which 
>>>> tags
>> should be
>>>>   forwarded? ("a", "b", "c", "d") (union) or () (intersection). In 
>>>> my
>> opinion
>>>>   the most obvious answer in this case would be the intersection. Let's
>>>>   modify it slightly though and I would assume a different 
>>>> behavior
>> (the
>>>>   union)
>>>> 
>>>>           splitted = ds.split();
>>>> 
>>>>           splitted.select("a", "b").map()
>>>> 
>>>>           splitted.select("c", "d").map()
>>>> 
>>>> Taking the 3rd argument into consideration I would be in favor of
>> removing
>>>> the current mechanism. I think the side outputs serve the purpose 
>>>> much better with much cleaner semantics. I get the argument that 
>>>> users are
>> now
>>>> forced to use processFunction if they want to use the side outputs. 
>>>> If
>> this
>>>> is the main problem how about enabling them e.g. for flatMap as well?
>>>> 
>>>> Best,
>>>> 
>>>> Dawid
>>>> On 17/06/2019 08:51, Jark Wu wrote:
>>>> 
>>>> +1 to keep the split/select API. I think if there are some problems 
>>>> +with
>>>> the API, it's better to fix them instead of deprecating them.
>>>> And select/split are straightforward and convenient APIs. It's 
>>>> worth to have them.
>>>> 
>>>> Regards,
>>>> Jark
>>>> 
>>>> On Mon, 17 Jun 2019 at 14:46, vino yang <yanghua1...@gmail.com> <
>> yanghua1...@gmail.com> wrote:
>>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> I also think it is valuable and reasonable to keep the split/select
>> APIs.
>>>> They are very convenient and widely used in our platform. I think 
>>>> they
>> are
>>>> also used in other users' jobs.
>>>> If the community has doubts about this, IMHO, it would be better to
>> start a
>>>> user survey.
>>>> 
>>>> Best,
>>>> Vino
>>>> 
>>>> SHI Xiaogang <shixiaoga...@gmail.com> <shixiaoga...@gmail.com>
>> 于2019年6月17日周一 上午11:55写道：
>>>> 
>>>> 
>>>> Hi Xingcan,
>>>> 
>>>> Thanks for bringing it up for discusson.
>>>> 
>>>> I agree with you that we should not deprecate the split/select methods.
>>>> Their semantics are very clear and they are widely adopted by Flink
>>>> 
>>>> users.
>>>> 
>>>> We should fix these problems instead of simply deprecating the methods.
>>>> 
>>>> Regards,
>>>> Xiaogang
>>>> 
>>>> Xingcan Cui <xingc...@gmail.com> <xingc...@gmail.com> 于2019年6月15日周六
>> 下午4:13写道：
>>>> 
>>>> 
>>>> Hi all,
>>>> 
>>>> Recently, I noticed that the split/select methods in DataStream API
>>>> 
>>>> have
>>>> 
>>>> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA 
>>>> issue
>>>> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> <
>> https://issues.apache.org/jira/browse/FLINK-11084>).
>>>> 
>>>> Although the two methods can be replaced by the more powerful side
>>>> 
>>>> output
>>>> 
>>>> feature[1], I still doubt whether we should really remove them in 
>>>> the future.
>>>> 
>>>> 1. From semantics, the split/select is the reverse operation to the
>>>> 
>>>> union
>>>> 
>>>> transformation. Without them, the DataStream API seems to be 
>>>> missing a piece.
>>>> 
>>>> 2. From accessibility, the side output only works for process
>>>> 
>>>> functions,
>>>> 
>>>> which means it forces the user to dive into a lower API.
>>>> 
>>>> According to FLINK-11084 <
>>>> https://issues.apache.org/jira/browse/FLINK-11084> <
>> https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
>>>> problems with the current implementation of the two methods. Maybe 
>>>> we should fix the problems and re-active them again. Or if they 
>>>> really
>>>> 
>>>> need
>>>> 
>>>> to
>>>> 
>>>> be deprecated, we should at least mark the corresponding 
>>>> documentation
>>>> 
>>>> for
>>>> 
>>>> that : )
>>>> 
>>>> What do you think?
>>>> 
>>>> Best,
>>>> Xingcan
>>>> 
>>>> [1]
>>>> 
>>>> 
>>>> 
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
>> _output.html
>>>> 
>>>> <
>>>> 
>>>> 
>>>> 
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
>> _output.html
>>>> 
>>>> 
>> 
>> 
>

Re: About Deprecating split/select for DataStream API

Reply via email to