Re: About Deprecating split/select for DataStream API

Dawid Wysakowicz Mon, 17 Jun 2019 23:18:38 -0700

Hi all,

I think we are getting closer to a consensus. I think most of us already
agree that the current behavior is broken. The remaining difference I
see is that I think those problems are caused by the design of the
split/select method. The current contract of the split method is that it
is actually applied to the previous operation, rather than it creates a
new operator. (I don't think this is an implementation detail. This is
the contract of the API. It was described in docs, conference talks,
workshops etc.).


I agree that we could reimplement the split with side outputs in a way
that it would add additional operator in a chain and emit results via
side outputs. This would change the core concepts of the split method
though, making the "fixed" split method a completely new one with a new
behavior. Therefore I am in favor of dropping the old method and
introducing a new one. This would be an explicit information for the
users that this is something different, so that they can make a
conscious choice. Moreover if we reimplement the split method in a way
that it introduces additional operator and emits results via side
outputs, this is basically equivalent to enabling side outputs in a
flatMap method. You can think of the split method as a flatMap method. I
see no benefit of having additional split method that would have a
limited functionality of such flatMap with side outputs.

So to sum up my stance:

1. I am against changing the behavior of current split/select methods.

2. I would be ok with introducing a similar, but new methods, with a new
behavior (but would prefer not to do that, we could achieve the same or
even more with existing APIs)

3. I would be in favor of enabling side outputs for flatMap method (if
it is possible)

Best,

Dawid

On 18/06/2019 03:45, Dian Fu wrote:
> Hi all,
>
> Thanks a lot for the discussion. I'm also in favor of rewriting/redesigning 
> the split/select API instead of removing them. It has been a consensus that 
> the side output API can achieve all the functionalities of the split/select 
> API. The problem is whether we should also support some easy-to-use APIs on 
> top of it. IMO, we should do that as long as the APIs have clear semantic and 
> wide usage scenario. I think split/select API is such a kind of API. 
>
> Regards,
> Dian
>
>> 在 2019年6月18日，上午12:30，xingc...@gmail.com 写道：
>>
>> Hi all,
>>
>> Thanks for sharing your thoughts on this topic.
>>
>> First, we must admit that the current implementation for split/select is 
>> flawed. I roughly went through the source codes, the problem may be that for 
>> consecutive select/split(s), the former one will be overridden by the later 
>> one during StreamGraph generation phase. That's why we forbid this 
>> consecutive logic in FLINK-11084.
>>
>> Now the question is whether we should guide users to migrate to the new side 
>> output feature or thoroughly rework the broken API with the correct 
>> semantics (instead of just trying to forbid all the "invalid" usages). 
>>
>> Personally, I prefer the later solution because
>>
>> 1. The split/select may have been widely used without touching the broken 
>> part.
>> 2. Though restricted compared with side output, the semantics for 
>> split/select itself is acceptable since union does not support different 
>> data types either.
>> 3. We need a complete and easy-to-use transformation set for DataStream API. 
>> Enabling side output for flatMap may not be an ultimate solution.
>>
>> To summarize, maybe we should not easily deprecate the split/select public 
>> API. If we come to a consensus on that, how about rewriting it based on side 
>> output? (like the implementation for join on coGroup)
>>
>> Any feedback is welcome : )
>>
>> Best,
>> Xingcan
>>
>> -----Original Message-----
>> From: SHI Xiaogang <shixiaoga...@gmail.com> 
>> Sent: Monday, June 17, 2019 8:08 AM
>> To: Dawid Wysakowicz <dwysakow...@apache.org>
>> Cc: dev@flink.apache.org
>> Subject: Re: About Deprecating split/select for DataStream API
>>
>> Hi Dawid,
>>
>> Thanks a lot for your example.
>>
>> I think most users will expect splitted1 to be empty in the example.
>>
>> The unexpected results produced, in my opinion, is due to our problematic 
>> implementation, instead of the confusing semantics.
>> We can fix the problem if we add a SELECT operator to filter out unexpected 
>> records (Of course, we can find some optimization to improve the 
>> efficiency.).
>>
>> After all, i prefer to fix the problems to make the results as expected.
>> What do you think?
>>
>> Regards,
>> Xiaogang
>>
>> Dawid Wysakowicz <dwysakow...@apache.org> 于2019年6月17日周一 下午7:21写道：
>>
>>> Yes you are correct. The problem I described applies to the split not 
>>> select as I wrote in the first email. Sorry for that.
>>>
>>> I will try to prepare a correct example. Let's have a look at this example:
>>>
>>>    val splitted1 = ds.split(if (1) then "a")
>>>
>>>    val splitted2 = ds.split(if (!=1) then "a")
>>>
>>> In those cases splitted1.select("a") -> will output all elements, the 
>>> same for splitted2, because the OutputSelector(s) are applied to 
>>> previous operator. The behavior I would assume is that splitted1 
>>> outputs only "1"s, whereas splitted2 all but "1"s
>>>
>>> On the other hand in a call
>>>
>>>    val splitted1 = ds.split(if ("1" or "2") then 
>>> "a").select("a").split(if ("3") then "b").select("b")
>>>
>>> I would assume an intersection of those two splits, so no results. 
>>> What actually happens is that it will be "1", "2" & "3"s. Actually, 
>>> right exceptions should be thrown in those cases not to produce 
>>> confusing results, but this just shows that this API is broken, if we 
>>> need to check for some prohibited configurations during runtime.
>>>
>>> Those weird behaviors are in my opinion results of the flawed API, as 
>>> it actually assigns an output selector to the previous operator. In 
>>> other words it modifies previous operator. I think it would be much 
>>> cleaner if this happened inside an operator rather than separately. 
>>> This is what SideOutputs do, as you define them inside the 
>>> ProcessFunction, rather than afterwards. Therefore I am very much in 
>>> favor of using them for those cases. Once again if the problem is that 
>>> they are available only in the ProcessFunction I would prefer enabling 
>>> them e.g. in FlatMap, rather than keeping the split/select.
>>>
>>>
>>>
>>> On 17/06/2019 09:40, SHI Xiaogang wrote:
>>>> Hi Dawid,
>>>>
>>>> As the select method is only allowed on SplitStreams, it's 
>>>> impossible to construct the example ds.split().select("a", 
>>>> "b").select("c", "d").
>>>>
>>>> Are you meaning ds.split().select("a", "b").split().select("c", "d")?
>>>> If so, then the tagging in the first split operation should not 
>>>> affect
>>> the
>>>> second one. Then
>>>>    splitted.select("a", "b") => empty
>>>>    splitted.select("c", "d") => ds
>>>>
>>>> I cannot quite catch your point here. It's appreciated if you can
>>> provide a
>>>> more concrete explanation?
>>>>
>>>> Regards,
>>>> Xiaogang Shi
>>>>
>>>>
>>>>
>>>>
>>>> Dawid Wysakowicz <dwysakow...@apache.org> 于2019年6月17日周一 下午3:10写道：
>>>>
>>>>> Hi all,
>>>>>
>>>>> Thank you for starting the discussion. To start with I have to say 
>>>>> I am not entirely against leaving them. On the other hand I totally 
>>>>> disagree that the semantics are clearly defined. Actually the 
>>>>> design is fundamentally flawed.
>>>>>
>>>>>   1. We use String as a selector for elements. This is not the cleanest
>>>>>   design, but I agree it is not the worst.
>>>>>   2. Users cannot define different types for different splits.
>>>>>   3. (The actual reason why I think it's actually better to drop the
>>>>>   split/select and introduce a better mechanism) The behavior of a
>>> split is
>>>>>   to actually add an output selector. We can have just a single
>>> selector on a
>>>>>   single operator, but the API allows (I would even say 
>>>>> encourages) to
>>> create
>>>>>   chains of split/select, which leads to undefined behavior. Take 
>>>>> this
>>> for
>>>>>   example: ds.split().select("a", "b").select("c", "d"). Which 
>>>>> tags
>>> should be
>>>>>   forwarded? ("a", "b", "c", "d") (union) or () (intersection). In 
>>>>> my
>>> opinion
>>>>>   the most obvious answer in this case would be the intersection. Let's
>>>>>   modify it slightly though and I would assume a different 
>>>>> behavior
>>> (the
>>>>>   union)
>>>>>
>>>>>           splitted = ds.split();
>>>>>
>>>>>           splitted.select("a", "b").map()
>>>>>
>>>>>           splitted.select("c", "d").map()
>>>>>
>>>>> Taking the 3rd argument into consideration I would be in favor of
>>> removing
>>>>> the current mechanism. I think the side outputs serve the purpose 
>>>>> much better with much cleaner semantics. I get the argument that 
>>>>> users are
>>> now
>>>>> forced to use processFunction if they want to use the side outputs. 
>>>>> If
>>> this
>>>>> is the main problem how about enabling them e.g. for flatMap as well?
>>>>>
>>>>> Best,
>>>>>
>>>>> Dawid
>>>>> On 17/06/2019 08:51, Jark Wu wrote:
>>>>>
>>>>> +1 to keep the split/select API. I think if there are some problems 
>>>>> +with
>>>>> the API, it's better to fix them instead of deprecating them.
>>>>> And select/split are straightforward and convenient APIs. It's 
>>>>> worth to have them.
>>>>>
>>>>> Regards,
>>>>> Jark
>>>>>
>>>>> On Mon, 17 Jun 2019 at 14:46, vino yang <yanghua1...@gmail.com> <
>>> yanghua1...@gmail.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I also think it is valuable and reasonable to keep the split/select
>>> APIs.
>>>>> They are very convenient and widely used in our platform. I think 
>>>>> they
>>> are
>>>>> also used in other users' jobs.
>>>>> If the community has doubts about this, IMHO, it would be better to
>>> start a
>>>>> user survey.
>>>>>
>>>>> Best,
>>>>> Vino
>>>>>
>>>>> SHI Xiaogang <shixiaoga...@gmail.com> <shixiaoga...@gmail.com>
>>> 于2019年6月17日周一 上午11:55写道：
>>>>>
>>>>> Hi Xingcan,
>>>>>
>>>>> Thanks for bringing it up for discusson.
>>>>>
>>>>> I agree with you that we should not deprecate the split/select methods.
>>>>> Their semantics are very clear and they are widely adopted by Flink
>>>>>
>>>>> users.
>>>>>
>>>>> We should fix these problems instead of simply deprecating the methods.
>>>>>
>>>>> Regards,
>>>>> Xiaogang
>>>>>
>>>>> Xingcan Cui <xingc...@gmail.com> <xingc...@gmail.com> 于2019年6月15日周六
>>> 下午4:13写道：
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Recently, I noticed that the split/select methods in DataStream API
>>>>>
>>>>> have
>>>>>
>>>>> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA 
>>>>> issue
>>>>> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> <
>>> https://issues.apache.org/jira/browse/FLINK-11084>).
>>>>> Although the two methods can be replaced by the more powerful side
>>>>>
>>>>> output
>>>>>
>>>>> feature[1], I still doubt whether we should really remove them in 
>>>>> the future.
>>>>>
>>>>> 1. From semantics, the split/select is the reverse operation to the
>>>>>
>>>>> union
>>>>>
>>>>> transformation. Without them, the DataStream API seems to be 
>>>>> missing a piece.
>>>>>
>>>>> 2. From accessibility, the side output only works for process
>>>>>
>>>>> functions,
>>>>>
>>>>> which means it forces the user to dive into a lower API.
>>>>>
>>>>> According to FLINK-11084 <
>>>>> https://issues.apache.org/jira/browse/FLINK-11084> <
>>> https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
>>>>> problems with the current implementation of the two methods. Maybe 
>>>>> we should fix the problems and re-active them again. Or if they 
>>>>> really
>>>>>
>>>>> need
>>>>>
>>>>> to
>>>>>
>>>>> be deprecated, we should at least mark the corresponding 
>>>>> documentation
>>>>>
>>>>> for
>>>>>
>>>>> that : )
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Best,
>>>>> Xingcan
>>>>>
>>>>> [1]
>>>>>
>>>>>
>>>>>
>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
>>> _output.html
>>>>> <
>>>>>
>>>>>
>>>>>
>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
>>> _output.html
>>>>>
>>>

signature.asc
Description: OpenPGP digital signature

Re: About Deprecating split/select for DataStream API

Reply via email to